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Preface 


Physicists have always approached the world through data and models inspired by this 
data. They build models from data and confront their models with the data generated by 
new experiments or observations. But real data is by nature noisy; until recently, classical 
statistical tools have been successful in dealing with this randomness. The recent emergence 
of very large datasets, together with the computing power to analyze them, has created a 
situation where not only the number of data points is large but also the number of studied 
variables. Classical statistical tools are inadequate to tackle this situation, called the large 
dimension limit (or the Kolmogorov limit). Random matrix theory, and in particular the 
study of large sample covariance matrices, can help make sense of these big datasets, and 
is in fact also becoming a useful tool to understand deep learning. Random matrix theory 
is also linked to many modern problems in statistical physics such as the spectral theory of 
random graphs, interaction matrices of spin-glasses, non-intersecting random walks, many- 
body localization, compressed sensing and many more. 

This book can be considered as one more book on random matrix theory. But our aim 
was to keep it purposely introductory and informal. As an analogy, high school seniors 
and college freshmen are typically taught both calculus and analysis. In analysis one learns 
how to make rigorous proofs, define a limit and a derivative. At the same time in calculus 
one can learn about computing complicated derivatives, multi-dimensional integrals and 
solving differential equations relying only on intuitive definitions (with precise rules) of 
these concepts. This book proposes a “calculus” course for random matrices, based in 
particular on the relatively new concept of “freeness”, that generalizes the standard concept 
of probabilistic independence to non-commuting random variables. 

Rather than make statements about the most general case, concepts are defined with 
some strong hypothesis (e.g. Gaussian entries, real symmetric matrices) in order to simplify 
the computations and favor understanding. Precise notions of norm, topology, convergence, 
exact domain of application are left out, again to favor intuition over rigor. There are many 
good, mathematically rigorous books on the subject (see references below) and the hope is 
that our book will allow the interested reader to read them guided by his/her newly built 
intuition. 
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Readership 


The book was initially conceived as a textbook for a graduate level standard 30 hours course 
in random matrix theory, for physicists or applied mathematicians, given by one of us (MP) 
during a sabbatical at UCLA in 2017-2018. As the book evolved many new developments, 
special topics and applications have been included. Lecturers can then customize their 
course offering by complementing the first few essential chapters with their own choice 
of chapters or sections from the rest of the book. 

Another group of potential readers are seasoned researchers analyzing large datasets 
who have heard that random matrix theory may help them distinguish signal from noise 
in singular value decompositions or eigenvalues of sample covariance matrices. They have 
heard of the Maréenko—Pastur distribution but do not know how to extend it to more real- 
istic settings where they might have non-Gaussian noise, true outliers, temporal (sample) 
correlations, etc. They need formulas to compute null hypothesis and so forth. They want 
to understand where these formulas come from intuitively without requiring full precise 
mathematical proofs. 

The reader is assumed to have a background in undergraduate mathematics taught in 
science and engineering: linear algebra, complex variables and probability theory. Impor- 
tant results from probability theory are recalled in the book (addition of independent vari- 
ables, law of large numbers and central limit theorem, etc.) while stochastic calculus and 
Bayesian estimation are not assumed to be known. Familiarity with physics approximation 
techniques (Taylor expansions, saddle point approximations) is helpful. 


How to Read This Book 


We have tried to make the book accessible for readers of different levels of expertise. The 
bulk of the text is hopefully readable by graduate students, with most calculations laid 
out in detail. We provide exercises in most chapters, which should allow the reader to 
check that he or she has understood the main concepts. We also tried to illustrate the book 
with as many figures as possible, because we strongly believe (as physicists) that pictures 
considerably help forming an intuition about the issue at stake. 

More technical issues, directed at experts in RMT or Statistical physics, are signaled by 
the use of a different, smaller font and an extra margin space. Chapters 3, 6, 7 and 13 
are marked with a star, meaning that they are not essential for beginners and they can be 
skipped at first reading. 

At the end of each chapter, we give a non-exhaustive list of references, some general and 
others more technical and specialized, which direct the reader to more in-depth information 
related to the subject treated in the chapter. 


Other Books on Related Subjects 
Books for Mathematicians 


There are many good recent books on random matrix theory and free probabilities written 
by and for mathematicians: Blower [2009], Anderson et al. [2010], Bai and Silverstein 
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[2010], Pastur and Scherbina [2010], Tao [2012], Erdős and Yau [2017], Mingo and 
Speicher [2017]. These books are often too technical for the intended readership of the 
present book. We nevertheless rely on these books to extract some relevant material for our 


purpose. 


Books for Engineers 


Communications engineers and now financial engineers have become big users of random 
matrix theory and there are at least two books specifically geared towards them. The style 
of engineering books is closer to the style of the present book and these books are quite 
readable for a physics audience. 

There is the short book by Tulino and Verdt [2004]; it gets straight to the point and gives 
many useful formulas from free probabilities to compute the spectrum of random matrices. 
The first part of this book covers some of the topics covered here, but many other subjects 
more related to statistical physics and to financial applications are absent. 

Part I of Couillet and Debbah [2011] has a greater overlap with the present book. Again 
about half the topics covered here are not present in that book (e.g. Dyson Brownian 
motion, replica trick, low-rank HcIz and the estimation of covariance matrices). 


Books for Physicists 


Physicists interested in random matrix theory fall into two broad categories: 

Mathematical physicists and high-energy physicists use it to study fundamental quan- 
tum interactions, from Wigner’s distribution of nuclear energy level spacing to models of 
quantum gravity using matrix models of triangulated surfaces. 

Statistical physicists encounter random matrices in the interaction matrices of spin- 
glasses, in the study of non-intersecting random walks, in the spectral analysis of large 
random graphs, in the theory of Anderson localization and many-body localization, and 
finally in the study of sample covariance matrices from large datasets. This book focuses 
primarily on statistical physics and data analysis applications. 

The classical book by Mehta [2004] is at the crossroads of these different approaches, 
whereas Forrester [2010], Brézin and Hikami [2016] and Eynard et al. [2006] are examples 
of books written by mathematical physicists. Livan et al. [2018] is an introductory book 
geared towards statistical physicists. That book is very similar in spirit to ours. The topics 
covered do not overlap entirely; for example Livan et al. do not cover the Dyson Brownian 
motion, the HcIz integral, the problem of eigenvector overlaps, sample covariance matrices 
with general true covariance, free multiplication, etc. 

We should also mention the handbook by Akemann et al. [2011] and the Les Houches 
summer school proceedings [Schehr et al., 2017], in which we co-authored a chapter on 
financial applications of RMT. That book covers a very wide range of topics. It is a useful 
complement to this book but too advanced for most of the intended readers. 

Finally, the present book has some overlap with a review article written with Joel Bun 
[Bun et al., 2017]. 
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Classical Random Matrix Theory 
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Deterministic Matrices 


Matrices appear in all corners of science, from mathematics to physics, computer science, 
biology, economics and quantitative finance. In fact, before Schrodinger’s equation, quan- 
tum mechanics was formulated by Heisenberg in terms of what he called “Matrix Mechan- 
ics”. In many cases, the matrices that appear are deterministic, and their properties are 
encapsulated in their eigenvalues and eigenvectors. This first chapter gives several elemen- 
tary results in linear algebra, in particular concerning eigenvalues. These results will be 
extremely useful in the rest of the book where we will deal with random matrices, and in 
particular the statistical properties of their eigenvalues and eigenvectors. 


1.1 Matrices, Eigenvalues and Singular Values 
1.1.1 Some Problems Where Matrices Appear 


Let us give three examples motivating the study of matrices, and the different forms that 
those can take. 


Dynamical System 


Consider a generic dynamical system describing the time evolution of a certain 
N-dimensional vector x(t), for example the three-dimensional position of a point in 
space. Let us write the equation of motion as 


egy 1.1 
x` (x), (1.1) 


where F(x) is an arbitrary vector field. Equilibrium points x* are such that F(x*) = 0. 
Consider now small deviations from equilibrium, i.e. x = x* + ey where € < 1. To first 
order in €, the dynamics becomes linear, and given by 


dy _ 
dt 
where A is a matrix whose elements are given by Aj; = 0; F;(x*), where i, j are indices 


that run from 1 to N. When F can itself be written as the gradient of some potential V, i.e. 
Fi = —0; V (x), the matrix A becomes symmetric, i.e. Ajj = Ajj = —0;; V. But this is not 


Ay, (1.2) 
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always the case; in general the linearized dynamics is described by a matrix A without any 
particular property — except that it is a square N x N array of real numbers. 


Master Equation 


Another standard setting is the so-called Master equation for the evolution of probabilities. 
Calli = 1,...,N the different possible states of a system and P;(t) the probability to 
find the system in state i at time t. When memory effects can be neglected, the dynamics 
is called Markovian and the evolution of P;(t) is described by the following discrete time 
equation: 


N 
P(t +1) = >> AGP; (1.3) 


j=l 


meaning that the system has a probability A;; to jump from state j to state i between ¢ and 
t+ 1. Note that all elements of A are positive; furthermore, since all jump possibilities must 
be exhausted, one must have, for each j, a Aj; = 1. This ensures that X; P;(t) = lat 
all times, since 


N N N N N 
on t+1)= 2 LPO = DYP AG PIO = ys P; = 1. (1.4) 
i=1 j=l j=li=l j=l 


Matrices such that all elements are positive and such that the sum over all rows is equal 
to unity for each column are called stochastic matrices. In matrix form, Eq. (1.3) reads 
P(t + 1) = AP(t), leading to P(t) = A’P(0), i.e. A raised to the t-th power applied to the 
initial distribution. 


Covariance Matrices 


As a third important example, let us consider random, N-dimensional real vectors X, with 
some given multivariate distribution P (X). The covariance matrix C of the X’s is defined as 


Cij = E[X; Xj] — E[XiJELXj], (1.5) 


where E means that we are averaging over the distribution P(X). Clearly, the matrix C is 
real and symmetric. It is also positive semi-definite, in the sense that for any vector x, 


x’ Cx > 0. (1.6) 


If it were not the case, it would be possible to find a linear combination of the vectors X 
with a negative variance, which is obviously impossible. 

The three examples above are all such that the corresponding matrices are N x N square 
matrices. Examples where matrices are rectangular also abound. For example, one could 
consider two sets of random real vectors: X of dimension N; and Y of dimension N2. The 
cross-covariance matrix defined as 
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Cia = E[X; Ya] — ELXJE[Y a); i=1,...,M; a=1,...,No, (1.7) 


is an Ny x N> matrix that describes the correlations between the two sets of vectors. 


1.1.2 Eigenvalues and Eigenvectors 


One learns a great deal about matrices by studying their eigenvalues and eigenvectors. For 
a square matrix A a pair of scalar and non-zero vector (A, v) satisfying 


Av = iv (1.8) 


is called an eigenvalue—eigenvector pair. 

Trivially if v is an eigenvector av is also an eigenvector when œ is a non-zero real 
number. Sometimes multiple non-collinear eigenvectors share the same eigenvalue; we say 
that this eigenvalue is degenerate and has multiplicity equal to the dimension of the vector 
space spanned by its eigenvectors. 

If Eq. (1.8) is true, it implies that the equation (A — 11)v = 0 has non-trivial solutions, 
which requires that det(Al — A) = 0. The eigenvalues à are thus the roots of the so-called 
characteristic polynomial of the matrix A, obtained by expanding det(A1 — A). Clearly, this 
polynomial! is of order N and therefore has at most N different roots, which correspond 
to the (possibly complex) eigenvalues of A. Note that the characteristic polynomial of A” 
coincides with the characteristic polynomial of A, so the eigenvalues of A and A” are 
identical. 

Now, let 41,42, ...,A~ be the N eigenvalues of A with v1, Vo, ...,Vy the corresponding 
eigenvectors. We define A as the N x N diagonal matrix with A; on the diagonal, and V 
as the N x N matrix whose jth column is vj, i.e. Vj; = (vj); is the ith component of vj. 
Then, by definition, 


AV = VA, (1.9) 
since once expanded, this reads 
XO Aik Vij = VijAj, (1.10) 
k 
or Av; = Aj;Vv;. If the eigenvectors are linearly independent (which is not true for all 


matrices), the matrix inverse V~! exists and one can therefore write A as 
A=VAV"|, (1.11) 


which is called the eigenvalue decomposition of the matrix A. 
Symmetric matrices (such that A = A”) have very nice properties regarding their 
eigenvalues and eigenvectors. 


! The characteristic polynomial Q y (A) = det(Al — A) always has a coefficient 1 in front of its highest power (Q y (A) = 
aN + oaN-! )), such polynomials are called monic. 
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e They have exactly N eigenvalues when counted with their multiplicity. 
e All their eigenvalues and eigenvectors are real. 


e Their eigenvectors are orthogonal and can be chosen to be orthonormal (i.e. vj vj = 
ôij). Here we assume that for degenerate eigenvalues we pick an orthogonal set of 
corresponding eigenvectors. 


If we choose orthonormal eigenvectors, the matrix V has the property VV = 1 (> 
V” = V~!). Hence it is an orthogonal matrix V = O and Eq. (1.11) reads 


A=OAO’, (1.12) 


where A is a diagonal matrix containing the eigenvalues associated with the eigenvectors 
in the columns of O. A symmetric matrix can be diagonalized by an orthogonal matrix. 
Remark that an N x N orthogonal matrix is fully parameterized by N(N — 1)/2 “angles”, 
whereas A contains N diagonal elements. So the total number of parameters of the diagonal 
decomposition is N(N — 1)/2 + N, which is identical, as it should be, to the number of 
different elements of a symmetric N x N matrix. 

Let us come back to our dynamical system example, Eq. (1.2). One basic question is to 
know whether the perturbation y will grow with time, or decay with time. The answer to 
this question is readily given by the eigenvalues of A. For simplicity, we assume F to be 
a gradient such that A is symmetric. Since the eigenvectors of A are orthonormal, one can 
decompose y in term of the v’s as 


N 


yO) = doc). (1.13) 


i=l 


Taking the dot product of Eq. (1.2) with v; then shows that the dynamics of the coefficients 
ci (t) are decoupled and given by 

dc; 

St = hich (1.14) 
where A; is the eigenvalue associated with v;. Therefore, any component of the initial 
perturbation y(t = 0) that is along an eigenvector with positive eigenvalue will grow expo- 
nentially with time, until the linearized approximation leading to Eq. (1.2) breaks down. 
Conversely, components along directions with negative eigenvalues decrease exponentially 
with time. An equilibrium x* is called stable provided all eigenvalues are negative, and 
marginally stable if some eigenvalues are zero while all others are negative. 

The important message carried by the example above is that diagonalizing a matrix 
amounts to finding a way to decouple the different degrees of freedom, and convert a matrix 
equation into a set of N scalar equations, as Eqs. (1.14). We will see later that the same 
idea holds for covariance matrices as well: their diagonalization allows one to find a set of 
uncorrelated vectors. This is usually called Principal Component Analysis (PCA). 
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Exercise 1.1.1 Instability of eigenvalues of non-symmetric matrices 
Consider the N x N square band diagonal matrix Mo defined by 


[Mo]:;; = 26), j-1: 
02 0 - 0 
O @ 2 a 0 
Mo = 00 0 0 (1.15) 
0 0 0 - 2 
000. 0 
(a) Show that My = 0 and so all the eigenvalues of Mo must be zero. 


Use a numerical eigenvalue solver for non-symmetric matrices and confirm 
numerically that this is the case for N = 100. 

(b) If O is an orthogonal matrix (OO? = 1), OMo0O” has the same eigenvalues 
as Mọ. Following Exercise 1.2.4, generate a random orthogonal matrix O. 
Numerically find the eigenvalues of OMO”. Do you get the same answer as 
in (a)? 

(c) Consider M; whose elements are all equal to those of Mo except for one 
element in the lower left corner [Mj ]y.1; = aja)" Z1. Show that MY = 1; 
more precisely, show that the characteristic polynomial of M; is given by 
det(M, — 41) = A — 1, therefore M; has N distinct eigenvalues equal to the 
N complex roots of unity Ag = e27#/N, 

(d) For N greater than about 60, OMoO” and OMO” are indistinguishable to 
machine precision. Compare numerically the eigenvalues of these two rotated 
matrices. 


1.1.3 Singular Values 


A non-symmetric, square matrix cannot in general be decomposed as A = OAO’, where 
A is a diagonal matrix and O an orthogonal matrix. One can however find a very useful 
alternative decomposition as 


A = VSU", (1.16) 


where S is a non-negative diagonal matrix, whose elements are called the singular values 
of A, and U, V are two real, orthogonal matrices. Whenever A is symmetric positive semi- 
definite, one has S = A and U = V. 

Equation (1.16) also holds for rectangular N x T matrices, where V is N x N orthogonal, 
U is T x T orthogonal and S is N x T diagonal as defined below. To construct the 
singular value decomposition (SvD) of A, we first introduce two matrices B and B, defined 
as B:= AA? and $ = AYA. It is plain to see that these matrices are symmetric, since 


8 Deterministic Matrices 


B” = (AA’)? = ATA" = B (and similarly for B). They are also positive semi-definite as 
for any vector x we have x’ Bx = ||A7x||* > 0. 

We can show that B and B have the same non-zero eigenvalues. In fact, let A > 0 be an 
eigenvalue of B and v # 0 is the corresponding eigenvector. Then we have, by definition, 


AA‘y = Av. (1.17) 
Let u = A’y, then we can get from the above equation that 
ATAA’y = AA’v => Bu = Au. (1.18) 
Moreover, 
lul? = v’AA’y = v'Bv +40 >u £0. (1.19) 


Hence à is also an eigenvalue of B. Note that for degenerate eigenvalues à of B, an 
orthogonal set of corresponding eigenvectors {vg} gives rise to an orthogonal set {A’ ve} 
of eigenvectors of B. Hence the multiplicity of A in B is at least that of B. Similarly, we can 
show that any non-zero eigenvalue of B is also an eigenvalue of B. This finishes the proof 
of the claim. 

Note that B has at most N non-zero eigenvalues and B has at most T non-zero eigen- 
values. Thus by the above claim, if T > N, B has at least T — N zero eigenvalues, and if 
T < N, B has at least N — T zero eigenvalues. We denote the other min{N, T} eigenvalues 
of B and B by {Ax}i<k<mintn,7}- Then the svp of A is expressed as Eq. (1.16), where V is 
the N x N orthogonal matrix consisting of the N normalized eigenvectors of B, U is the 
T x T orthogonal matrix consisting of the T normalized eigenvectors of B, and S is an 
N x T rectangular diagonal matrix with Skk = Jp > 0,1 <k < min{N,T} and all other 
entries equal to zero. 

For instance, if N < T, we have 


J 0 0 0 0 
0 VW 0 0 0 

S= (1.20) 
0 0 0 0 


0 0 0 Viy > 0 
Although (non-degenerate) normalized eigenvectors are unique up to a sign, the choice of 
the positive sign for the square-root ./A; imposes a condition on the combined sign for the 
left and right singular vectors vg and ux. In other words, simultaneously changing both vg 
and ux to —vz and —ux leaves the matrix A invariant, but for non-zero singular values one 
cannot individually change the sign of either vg or ux. 

The recipe to find the svp, Eq. (1.16), is thus to diagonalize both AA’ (to obtain V 
and S*) and A’A (to obtain U and again S?). It is insightful to again count the number of 
parameters involved in this decomposition. Consider a general N x T matrix with T > N 
(the case N > T follows similarly). The N eigenvectors of AA’ are generically unique 
up to a sign, while for T — N > O the matrix AA will have a degenerate eigenspace 
associated with the eigenvalue 0 of size T — N, hence its eigenvectors are only unique up 
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to an arbitrary rotation in T — N dimension. So generically the svp decomposition amounts 
to writing the NT elements of A as 


NT = SN(N-D4N457(0 ~1)-5(7- NYT -N-D), (1.21) 


The interpretation of Eq. (1.16) for N x N matrices is that one can always find an 
orthonormal basis of vectors {u} such that the application of a matrix A amounts to a 
rotation (or an improper rotation) of {u} into another orthonormal set {v}, followed by a 
dilation of each v; by a positive factor Ax. 

Normal matrices are such that U = V. In other words, A is normal whenever A com- 
mutes with its transpose: AA’ = ATA. Symmetric, skew-symmetric and orthogonal matri- 
ces are normal, but other cases are possible. For example a 3 x 3 matrix such that each row 
and each column has exactly two elements equal to 1 and one element equal to 0 is normal. 


1.2 Some Useful Theorems and Identities 


In this section, we state without proof very useful theorems on eigenvalues and matrices. 


1.2.1 Gershgorin Circle Theorem 


Let A be a real matrix, with elements A;;. Define R; as Rj = aor |Aj;|, and D; a disk in 
the complex plane centered on Aj; and of radius R;. Then every eigenvalue of A lies within 
at least one disk D;. For example, for the matrix 


1 —02 02 
A={-03 2 -02], (1.22) 
OF 1il 3 


the three circles are located on the real axis at x = 1,2 and 3 with radii 0.4, 0.5 and 1.1 
respectively (see Fig. 1.1). 

In particular, eigenvalues corresponding to eigenvectors with a maximum amplitude on 
i lie within the disk D;. 


1.2.2 The Perron—Frobenius Theorem 


Let A be a real matrix, with all its elements positive A;; > 0. Then the top eigenvalue Amax 
is unique and real (all other eigenvalues have a smaller real part). The corresponding top 
eigenvector v* has all its elements positive: 


Av” = A mav“; vý > 0, Vk. (1.23) 
The top eigenvalue satisfies the following inequalities: 


i A; <À < A;;. 1.24 
ie a iy = “max = MAR 2 UJ ( ) 
J J 
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1.0 
0.5 
= 
z 0.0 
—0.5 
-1.0 
0 1 2 3 4 
Re (À) 


Figure 1.1 The three complex eigenvalues of the matrix (1.22) (crosses) and its three Gershgorin 
circles. The first eigenvalue A; ~ 0.92 falls in the first circle while the other two A7 3 ~ 2.54 + 0.18i 


fall in the third one. 


Application: Suppose A is a stochastic matrix, such that all its elements are positive 
and satisfy `; Ajj = 1, Vj. Then clearly the vector 1 is an eigenvector of A’, with 
eigenvalue à = 1. But since the Perron—Frobenius can be applied to A’, the inequalities 
(1.24) ensure that A is the top eigenvalue of A’, and thus also of A. All the elements of the 
corresponding eigenvector v* are positive, and describe the stationary state of the associated 
Master equation, i.e. 

ve 
L 
= : (1.25) 
paar 


P* =) Ajj P* — PF 
j 


Exercise 1.2.1 Gershgorin and Perron—Frobenius 
Show that the upper bound in Eq. (1.24) is a simple consequence of the 
Gershgorin theorem. 


1.2.3 The Eigenvalue Interlacing Theorem 


Let A be an N x N symmetric matrix (or more generally Hermitian matrix) with eigenvalues 
A, > Az2-++ > Ay. Consider the N — 1 x N — 1 submatrix A\; obtained by removing the ith 
row and ith columns of A. Its eigenvalues are u? > uý yi. = uÊ ,- Then the following 
interlacing inequalities hold: 


(i) 


Ay > my? > Age > wy > Aw. (1.26) 
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Very recently, a formula relating eigenvectors to eigenvalues was (re-)discovered. Calling 
v; the eigenvector of A associated with A;, one has” 


N-1 (Gj) 
Ic D? = Tei Ài -nuy (1.27) 
Vi J st mn | Sn as . 
Tent, ex: Ai — Xe 


1.2.4 Sherman—Morrison Formula 


The Sherman—Morrison formula gives the inverse of a matrix A perturbed by a rank-1 
perturbation: 

A`luv" AT! 
14A u 
valid for any invertible matrix A and vectors u and v such that the denominator does not 
vanish. This is a special case of the Woodbury identity, which reads 


(A + uv”)! = A7! (1.28) 


—1 
(A+uCV’) = A7! -A'U (C7 + VATU) VAT, (1.29) 


where U, V are N x K matrices and C is a K x K matrix. Equation (1.28) corresponds to 
the case K = 1. 
The associated Sherman—Morrison determinant lemma reads 


det(A + vul) = detA- (1 +ulAq'y) (1.30) 


for invertible A. 


Exercise 1.2.2 Sherman—Morrison 
Show that Eq. (1.28) is correct by multiplying both sides by (A + uv’). 


1.2.5 Schur Complement Formula 


The Schur complement, also called inversion by partitioning, relates the blocks of the 
inverse of a matrix to the inverse of blocks of the original matrix. Let M be an invertible 
matrix which we divide in four blocks as 


Mi Mi -1 Qi: Qz 
M= dM =Q= 1.31 
( M2; Mz ) ii Q ( Qo, Q2 Ji (saN 


where [M11] = 7 x n, [M12] = n x (N — n), [Mai] = (N = n) x n, [M22] = (N = n) x 
(N — n), and Mo is invertible. The integer n can take any values from | to N — 1. 


2 See: P. Denton, S. Parke, T. Tao, X. Zhang, Eigenvalues from Eigenvectors: a survey of a basic identity in linear algebra, 
arXiv: 1908.03795. 
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Then the upper left n x n block of Q is given by 
QT! = Mi — Mi2(Mx2)~'Mno1, (1.32) 


where the right hand side is called the Schur complement of the block M22 of the matrix M. 


Exercise 1.2.3 Combining Schur and Sherman—Morrison 
In the notation of Eq. (1.31) for n = 1 and any N > 1, combine the Schur 
complement of the lower right block with the Sherman—Morrison formula to 
show that 


(Mo2)~'Mo1M2(Mo2)~! 
Mii — Mi2(Mo2)—!Mo1 ~ 


Q2 = (M2)! + (1.33) 


1.2.6 Function of a Matrix and Matrix Derivative 


In our study of random matrices, we will need to extend real or complex scalar functions 
to take a symmetric matrix M as its argument. The simplest way to extend such a function 
is to apply it to each eigenvalue of the matrix M = OAO”: 


F(M) = OF(A)O’, (1.34) 


where F(A) is the diagonal matrix where we have applied the function F to each (diag- 
onal) entry of A. The function F (M) is now a matrix valued function of a matrix. Scalar 
polynomial functions can obviously be extended directly as 


K K 
F(x) = X ax > FM) = Do aM*, (1.35) 
k=0 k=0 
but this is equivalent to applying the polynomial to the eigenvalues of M. By extension, 
when the Taylor series of the function F (x) converges for every eigenvalue of M the matrix 
Taylor series coincides with our definition. 
Taking the trace of F(M) will yield a matrix function that returns a scalar. This con- 
struction is rotationally invariant in the following sense: 


Tr F(UMU’ ) = Tr F (M) for any UU’ = 1. (1.36) 


We can take the derivative of a scalar-valued function Tr F(M) with respect to each 
element of the matrix M: 


d 
Tr(F (M)) = [FM]; > = Tr(F(M)) = FM). (1.37) 


d[M];; dM 


Equation (1.37) is easy to derive when F(x) is a monomial agx* and by linearity for 
polynomial or Taylor series F (x). 
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1.2.7 Jacobian of Simple Matrix Transformations 


Suppose one transforms an N x N matrix A into another N x N matrix B through some 
function of the matrix elements. The Jacobian of the transformation is defined as the 
determinant of the partial derivatives: 


OBxe 
ðAij ` 


Gij, ke = (1.38) 
The simplest case is just multiplication by a scalar: B = œA, leading to Gij ke = œôikô je. 
G is therefore the tensor product of œ1 with 1, and its determinant is thus equal to a. 
Not much more difficult is the case of an orthogonal transformation B = OAO’, for which 
Gij ke = OikO je. G is now the tensor product G = O @ O and therefore its determinant is 
unity. 

Slightly more complicated is the case where B = A~!. Using simple algebra, one readily 
obtains, for symmetric matrices, 


1 1 
Gij,ne = zA A] je + zA eA] (1.39) 


Let us now assume that A has eigenvalues àg and eigenvectors Vy. One can easily diago- 
nalize Gij ke within the symmetric sector, since 


1 


Tais [Vai V8, j + Va, jYg.i]- (1.40) 


a Gij,Ke [Vo,KVB, ¢ + Va, 0Vp,k| = 
ke 


So the determinant of G is simply ae pza Ughay Taking the logarithm of this product 
helps avoiding counting mistakes, and finally leads to the result 


det G = (detA)~*~!. (1.41) 


Exercise 1.2.4 Random Matrices 
We conclude this chapter on deterministic matrices with a numerical exercise 
on random matrices. Most of the results of this exercise will be explored 
theoretically in the following chapters. 


e Let M be arandom real symmetric orthogonal matrix, that is an N x N matrix 
satisfying M = M” = M~!. Show that all the eigenvalues of M are +1. 

e Let X be a Wigner matrix, i.e. an N x N real symmetric matrix whose diagonal 
and upper triangular entries are 11D Gaussian random numbers with zero mean 
and variance o*/N. You can use X = ø (H + H’)//2N where H is a non- 
symmetric N x N matrix with mp standard Gaussians. 

e The matrix P, is defined as P} = 5(M + 1x). Convince yourself that P} is 
the projector onto the eigenspace of M with eigenvalue +1. Explain the effect 
of the matrix P, on eigenvectors of M. 


(d) 


(e) 
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An easy way to generate a random matrix M is to generate a Wigner matrix 
(independent of X), diagonalize it, replace every eigenvalue by its sign and 
reconstruct the matrix. The procedure does not depend on the o used for the 
Wigner. 

We consider a matrix E of the form E = M +X. To wit, E is a noisy version of 
M. The goal of the following is to understand numerically how the matrix E is 
corrupted by the Wigner noise. Using the computer language of your choice, 
for a large value of N (as large as possible while keeping computing times 
below one minute), for three interesting values of o of your choice, do the 
following numerical analysis. 


Plot a histogram of the eigenvalues of E, for a single sample first, and then for 
many samples (say 100). 

From your numerical analysis, in the large N limit, for what values of o do 
you expect a non-zero density of eigenvalues near zero. 

For every normalized eigenvector v; of E, compute the norm of the vector 
P,v;. For a single sample, do a scatter plot of |P;v;|* vs A; (its eigenvalue). 
Turn your scatter plot into an approximate conditional expectation value 
(using a histogram) including data from many samples. 

Build an estimator & (E) of M using only data from E. We want to minimize 
the error & = y||(G(E) — M)|| where ||A||f = TrAA’. Consider first 
&,(E) = E and then &o(E) = 0. What is the error & of these two estimators? 
Try to build an ad-hoc estimator E (E) that has a lower error & than these two. 
Show numerically that the eigenvalues of E are not 1p. For each sample E 
rank its eigenvalues A; < à2 < --- < Ay. Consider the eigenvalue spacing 
Sk = Àk — Ax_-1 for eigenvalues in the bulk (.2N < k < .3N and .7N < k < 
.8N). Make a histogram of {są} including data from 100 samples. Make 100 
pseudo-lID samples: mix eigenvalues for 100 different samples and randomly 
choose N from the 100N possibilities, do not choose the same eigenvalue 
twice for a given pseudo-lID sample. For each pseudo-1D sample, compute 
sx in the bulk and make a histogram of the values using data from all 100 
pseudo-11D samples. (Bonus) Try to fit an exponential distribution to these two 
histograms. The 11D case should be well fitted by the exponential but not the 
original data (not 1D). 
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Wigner Ensemble and Semi-Circle Law 


In many circumstances, the matrices that are encountered are large, and with no particular 
structure. Physicist Eugene Wigner postulated that one can often replace a large complex 
(but deterministic) matrix by a typical element of a certain ensemble of random matrices. 
This bold proposal was made in the context of the study of large complex atomic nuclei, 
where the “matrix” is the Hamiltonian of the system, which is a Hermitian matrix describ- 
ing all the interactions between the neutrons and protons contained in the nucleus. At the 
time, these interactions were not well known; but even if they had been, the task of diag- 
onalizing the Hamiltonian to find the energy levels of the nucleus was so formidable that 
Wigner looked for an alternative. He suggested that we should abandon the idea of finding 
precisely all energy levels, but rather rephrase the question as a statistical question: what 
is the probability to find an energy level within a certain interval, what is the probability 
that the distance between two successive levels is equal to a certain value, etc.? The idea 
of Wigner was that the answer to these questions could be, to some degree, universal, 
i.e. independent of the specific Hermitian matrix describing the system, provided it was 
complex enough. If this is the case, why not replace the Hamiltonian of the system by a 
purely random matrix with the correct symmetry properties? In the case of time-reversal 
invariant quantum systems, the Hamiltonian is a real symmetric matrix (of infinite size). 
In the presence of a magnetic field, the Hamiltonian is a complex, Hermitian matrix (see 
Section 3.1.1). In the presence of “spin—orbit coupling”, the Hamiltonian is symplectic (see 
Section 3.1.2). 

This idea has been incredibly fruitful and has led to the development of a subfield 
of mathematical physics called “random matrix theory”. In this book we will study the 
properties of some ensembles of random matrices. We will mostly focus on symmetric 
matrices with real entries as those are the most commonly encountered in data analy- 
sis and statistical physics. For example, Wigner’s idea has been transposed to glasses 
and spin-glasses, where the interaction between pairs of atoms or pairs of spins is often 
replaced by a real symmetric, random matrix (see Section 13.4). In other cases, the ran- 
domness stems from noisy observations. For example, when one wants to measure the 
covariance matrix of the returns of a large number of assets using a sample of finite length 
(for example the 500 stocks of the S&P500 using 4 years of daily data, i.e. 4 x 250 = 
1000 data points per stock), there is inevitably some measurement noise that pollutes the 
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determination of said covariance matrix. We will be confronted with this precise problem in 
Chapters 4 and 17. 

In the present chapter and the following one, we will investigate the simplest of all 
ensembles of random matrices, which was proposed by Wigner himself in the context 
recalled above. These are matrices where all elements are Gaussian random variables, with 
the only constraint that the matrix is real symmetric (the Gaussian orthogonal ensemble, 
GOE), complex Hermitian (the Gaussian unitary ensemble, GUE) or symplectic (the Gaus- 
sian symplectic ensemble, GSE). 


2.1 Normalized Trace and Sample Averages 


We first generalize the notion of expectation value and moments from classical probabilities 
to large random matrices. We could simply consider the moments E[A‘] but that object is 
very large (N x N dimensional). It is not clear how to interpret it as N — oo. It turns 
out that the correct analog of the expectation value is the normalized trace operator T(.), 
defined as 


(A) := i z[Tr A]. (2.1) 


The normalization by 1/N is there to make the normalized trace operator finite as N —> oo. 
For example for the identity matrix, t(1) = 1 independently of the dimension and our 
definition therefore makes sense as N — oo. When using the notation t(A) we will only 
consider the dominant term as N — oo, implicitly taking the large N limit. 

For a polynomial function of a matrix F(A) or by extension for a function that can be 
written as a power series, the trace of the function can be computed on the eigenvalues: 


1 lige 
a Tr F(A) = 7 3 F (Ax). (2.2) 


In the following, we will denote as (.) the average over the eigenvalues of a single matrix 
A (sample), i.e. 


1 N 
(FA) = =D) FOW. (2.3) 
k=1 


For large random matrices, many scalar quantities such as t(F(A)) do not fluctuate from 
sample to sample, or more precisely such fluctuations go to zero in the large N limit. 
Physicists speak of this phenomenon as self-averaging and mathematicians speak of con- 
centration of measure. 


t(F(A)) = ~ i[Tr F(A)] ~ (F'(A)) for a single A. (2.4) 


When the eigenvalues of a random matrix A converge to a well-defined density p(A), we 
can write 
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t(F(A)) = / p(A)F(A)dA. (2.5) 


Using F(A) = A‘, we can define the kth moment of a random matrix by mg := t(A‘). 
The first moment mı is simply the normalized trace of A, while mz = 1/N Diy Aj, is 
the normalized sum of the squares of all the elements. The square-root of m2 satisfies the 
axioms of a norm and is called the Frobenius norm of A: 


I|Allp := J/mz. (2.6) 


2.2 The Wigner Ensemble 
2.2.1 Moments of Wigner Matrices 


We will define a Wigner matrix X as a symmetric matrix (X = X”) with Gaussian entries 
with zero mean. In a symmetric matrix there are really two types of elements: diagonal and 
off-diagonal, which can have different variances. Diagonal elements have variance ot and 
off-diagonal elements have variance Oats Note that X;; = Xj; so they are not independent 
variables. 

In fact, the elements in a Wigner matrix do not need to be Gaussian or even to be IID, 
as there are many weaker (more general) definitions of the Wigner matrix that yield the 
same final statistical results in the limit of large matrices N — oo. For the purpose of this 
introductory book we will stick to the strong Gaussian hypothesis. 


The first few moments of our Wigner matrix X are given by 


dn >A eee 
tX) = CElTrX] =~ TrE[X] = 0, (2.7) 


N 
1 1 1 
t(X’) = - [Tr XX7] = vE yx, | = poy 1)oĝ + Nog]. (2.8) 


ij=1 


The term containing Ge dominates when the two variances are of the same order of mag- 
nitude. So for a Wigner matrix we can pick any variance we want on the diagonal (as long 
as it is small with respect to N o2). We want to normalize our Wigner matrix so that its 
second moment is independent of the size of the matrix (N). Let us pick 


o2, = 07 /N. (2.9) 


For og the natural choice seems to be o? =o /N. However, we will rather choose Cr = 


207/N, which is easy to generate numerically and more importantly respects rotational 
invariance for finite N, as we show in the next subsection. The ensemble described here 


(with the choice a. = 204) is called the Gaussian orthogonal ensemble or GOE.! 


1 Some authors define a GOE matrix to have ø? = 1 others as o? = N. For us a GOE matrix can have any variance and is thus 
synonymous with the Gaussian rotationally invariant Wigner matrix. 
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To generate a GOE matrix numerically, first generate a non-symmetric random square 
matrix H of size N with up N(0,o7 /(2N)) coefficients. Then let the Wigner matrix X 
be X = H+ H’. The matrix X will then be symmetric with diagonal variance twice the 
off-diagonal variance. The reason is that off-diagonal terms are sums of two independent 
Gaussian variables, so the variance is doubled. Diagonal elements, on the other hand, are 
equal to twice the original variables H;; and so their variance is multiplied by 4. 

With any choice of ot we have 


tX?) =07+O(1/N), (2.10) 


and hence we will call the parameter ø? the variance of the Wigner matrix. 
The third moment t(X*) = 0 from the fact that the Gaussian distribution is even. Later 
we will show that 


T(X*) = 20%. (2.11) 


For standard Gaussian variables E[x*] = 304, this implies that the eigenvalue density of 
a Wigner is not Gaussian. What is this eigenvalue distribution? As we will show many 
times over in this book, it is given by the semi-circle law, originally derived by Wigner 
himself: 


pO) = —— for —20 <A < 20. (2.12) 
Oo 


2.2.2 Rotational Invariance 


We remind the reader that to rotate a vector v, one applies a rotation matrix O: w = Ov 
where O is an orthogonal matrix O7” = O7! (i.e. OO” = 1). Note that in general O is not 
symmetric. To rotate the basis in which a matrix is written, one writes X = OXO”. The 
eigenvalues of X are the same as those of X. The eigenvectors are {Ov} where {v} are the 
eigenvectors of X. 

A rotationally invariant random matrix ensemble is such that the matrix OXO” is as 
probable as the matrix X itself, i.e. OXO” nN y 

Let us show that the construction X = H + H” with a Gaussian mp matrix H leads to 
a rotationally invariant ensemble. First, note an important property of Gaussian variables, 
namely that a Gaussian IID vector v (a white multivariate Gaussian vector) is rotationally 
invariant. The reason is that w = Ov is again a Gaussian vector (since sums of Gaussians 
are still Gaussian), with covariance given by 


i[w;wj] = x Oj, O jcE[ugve] = Xo Oik O jeôke = [OO"];; = ôij. (2.13) 
ke ke 


Now, write 


X=H+H’, (2.14) 
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where H is a square matrix filled with 1D Gaussian random numbers. Each column of H 


is rotationally invariant: OH "=" H and the matrix OH is row-wise rotationally invariant: 


OHO’ "=" OH. So His rotationally invariant as a matrix. Now 
OXO” = O(H + H’)O" "E H +H’ =X, (2.15) 


which shows that the Wigner ensemble with og = 202 is rotationally invariant for any 
matrix size N. More general definitions of the Wigner ensemble (including non-Gaussian 
ensembles) are only asymptotically rotationally invariant (i.e. when N —> oo). 

Another way to see the rotational invariance of the Wigner ensemble is to look at the 
joint law of matrix elements: 


i N/2 i N(N—1)/4 N y2 N x2 
PQXi)= ex a , (2.16 
Xi) 27 o? 2mo R È 204 2 202 ( ) 


i=l i<j “od 


where only the diagonal and upper triangular elements are independent variables. With the 
choice o2 =o7*/N and a, = 207/N this becomes 


P({Xij}) o exp l- Tex?) . (2.17) 


Under the change of variable X —> X = OXO” the argument of the exponential is invariant, 
because the trace of a matrix is independent of the basis, and because the Jacobian of the 
transformation is equal to 1 (see Section 1.2.7), therefore X Oxo”. 

By the same argument any matrix whose joint probability density of its elements can 
be written as P({M;;j}) « exp{—N Tr V(M)}, where V(.) is an arbitrary function, will be 


rotationally invariant. We will study such matrix ensembles in Chapter 5. 


2.3 Resolvent and Stieltjes Transform 
2.3.1 Definition and Basic Properties 


In this section we introduce the Stieltjes transform of a matrix. It will give us information 
about all the moments of the random matrix and also about the density of its eigenvalues in 
the large N limit. First we need to define the matrix resolvent. 

Given an N x N real symmetric matrix A, its resolvent is given by 


Ga(z) = z1- A), (2.18) 


where z is a complex variable defined away from all the (real) eigenvalues of A and 1 
denotes the identity matrix. Then the Stieltjes transform of A is given by? 


N 
A) = = Tr(Ga@) = = > — (2.19) 
8 zZz) = — Ir AZ = — 7 : 
y N N ^z- ìk 
k=1 
2 In mathematical literature, the Stieltjes transform is more commonly defined as s4 (z) = —(1/N) Tr Ga (z), i.e. with an extra 


minus sign. Some authors prefer the name Cauchy transform. 
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where A, are the eigenvalues of A. The subscript N indicates that this is the finite N Stieltjes 
transform of a single realization of A. When it is clear from context which matrix we 
consider we will drop the superscript A and write gy (z). 

Let us see why the Stieltjes transform gives useful information about the density of 
eigenvalues of A. For a given random matrix A, we can define the empirical spectral 
distribution (ESD) also called the sample eigenvalue density: 


N 
1 
PNA) = >} d(A— An), (2.20) 
N 
k=1 
where (x) is the Dirac delta function. Then the Stieltjes transform can be written as 
+00 a 
gn (2) =) PNO) ay (2.21) 
-œo ZA 


Note that gy (z) is well defined for any z ¢ {àx : 1 < k < N}. In particular, it is well 
behaved at oo: 


[0,6] 
1 1 1 
=y ATAF — Tr(A®) = 1. 2.22 
gn (2) 2L N r(A"), g Tr(A®) (2.22) 


We will consider random matrices A such that, for large N, the normalized traces of powers 
of A converge to their expectation values, which are deterministic numbers: 


_ 1 k k 
a N Tr(A*) = t (A*). (2.23) 


We then expect that, for large enough z, the function gą (z) converges to a deterministic 
limit g(z) defined as g(z) = limy—>o E[gy(z)], whose Taylor series is 


a 
82) = Dat (A), (2.24) 
k20 ~ 
for z away from the real axis. 

Thus g(z) is a moment generating function of A. In other words, the knowledge of g(z) 
near infinity is equivalent to the knowledge of all the moments of A. To the level of rigor 
of this book, the knowledge of all the moments of A is equivalent to the knowledge of the 
density of its eigenvalues. For any function F (x) defined over the support of the eigenvalues 


[A_,A4] of A we can compute its expectation: 


À 
a= f T POFA); PÀ) := E[pa(a)]. (2.25) 


Alternatively we can approximate the function F(x) arbitrarily well by a polynomial 
Q(x) =ao+ax +---+aKx* and find 


K 
T(F(A)) ~ t(Q(A)) = J art (A*). (2.26) 


k=0 
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To recap, we only need to know g(z) in the neighborhood of |z| — oo to know all the 
moments of A and these moments tell us everything about p (à). In computing the Stieltjes 
transform in concrete cases, we will often make use of that fact and only estimate it for 
very large values of z. 

The Stieltjes transform also gives the negative moments when they exist. If the eigen- 
values of A satisfy ming àg > c for some constant c > 0, then the inverse moments of A 
exist and are given by the expansion of g(z) around z = 0: 


g) =- do rA. (2.27) 
k=0 


In particular, we have 


g0) = =t (A7'). (2.28) 


Exercise 2.3.1 Stieltjes transform for shifted and scaled matrices 
Let A be a random matrix drawn from a well-behaved ensemble with Stieltjes 
transform g(z). What are the Stieltjes transforms of the random matrices ~A and 
A + 61 where œ and £ are non-zero real numbers and 1 the identity matrix? 


2.3.2 Stieltjes Transform of the Wigner Ensemble 


We are now ready to compute the Stieltjes transform of the Wigner ensemble. The first 
technique we will use is sometimes called the cavity method or the self-consistent equation. 
We will find a relation between the Stieltjes transform of a Wigner matrix of size N and 
one of size N — 1. In the large N limit, the two converge to the same limiting Stieltjes 
transform and give us a self-consistent equation that can be solved easily. 

We would like to calculate o% (z) when X is a Wigner matrix, with X;; ~ N (0, o? /N) 
and X;; ~ N (0, 20? /N). In the large N limit, we expect that o% (z) converges towards a 
well-defined limit g(z). 

We can use the Schur complement formula (1.32) to compute the (1, 1) element of the 
inverse of M = z1 — X. Then we have 


N 
1 -1 

—— = Mi - Mix (M22),, Mn, (2.29) 

(Gx)11 pe m 


where the matrix M22 is the (N — 1) x (N — 1) submatrix of M with the first row and column 
removed. For large N, we argue that the right hand side is dominated by its expectation 
value with small (O(1/ /N)) fluctuations. We will only compute its expectation value, 
but getting a more precise handle on its fluctuations would not be difficult. First, we note 
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that E[M,1] = z. We then note that the entries of M22 are independent of the ones of 


Mı; = —Xj;. Thus we can first take the partial expectation over the {X1;}, and get 
g 
UX} [Mi M22);'Mı; | = ay MD) Sij (2.30) 
so we have 
` x =i o? -1 
2(X;} Dae Mn | = eo (M2) ) . (2.31) 


Another observation is that 1/(N — 1) Tr ((Mz2)~!) is the Stieltjes transform of a Wigner 
matrix of size N — 1 and variance o? (N —1)/N. In the large N limit, the Stieltjes transform 
should be independent of the matrix size and the difference between N and (N — 1) is 
negligible. So we have 


1 
} Fac) —> g(2). (2.32) 


We therefore have that 1/(Gx)11 equals a deterministic number with negligible fluctua- 
tions; hence in the large N limit we have 


1 1 
J = . 2.33 
Ken EL(Gx)11] (23% 


From the rotational invariance of X and therefore of Gx, all diagonal entries of Gx must 
have the same expectation value: 


1 
3 [(Gx)11] = W o[Tr(Gx)] = E[gn] > 9. (2.34) 
Putting all the pieces together, we find that in the large N limit Eq. (2.29) becomes 
ERE ala: (2.35) 
g(z) 
Solving (2.35) we obtain that 
zt V22— 40? 


og —zg+1=0>9= (2.36) 


202 
We know that g(z) should be analytic for large complex z but the square-root above can run 
into branch cuts. It is convenient to pull out a factor of z and express the square-root as a 
function of 1/z which becomes small for large z: 
zz J/1—402/22 
g(z) = 5 
20 
We can now choose the correct root: the + sign gives an incorrect 9(z) ~ z/o? for large z 
while the — sign gives g(z) ~ 1/z for any large complex z as expected, so we have: 


— zy 1 — 402/z? 


202 


(2.37) 


a(z) = < 


(2.38) 
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Figure 2.1 The branch cuts of the Wigner Stieltjes transform. 


Note, for numerical applications, it is very important to pick the correct branch of the 
square-root. The function g(z) is analytic for |z| > 20, the branch cuts of the square-root 
must therefore be confined to the interval [—20,20] (Fig. 2.1). We will come back to this 
problem of determining the correct branch of Stieltjes transform in Section 4.2.3. 

It might seem strange that g(z) given by Eq. (2.38) has no poles but only branch cuts. 
For finite N, the sample Stieltjes transform 


N 


l l 
gn @) = = 2 cae (2.39) 


has poles at the eigenvalues of X. As N — on, the poles fuse together and 


1 N 
7 DD d(x — Ag) ~ p(x). (2.40) 
k=1 


The density p(x) can have extended support and/or isolated Dirac masses. Then as N —> ov, 
we have 


g(z) = f PREE. (2.41) 
supp{ 


p Z75x 


which is the Stieltjes transform of the limiting measure p(x). 


2.3.3 Convergence of Stieltjes near the Real Axis 


It is natural to ask the following questions: how does gy (z) given by Eq. (2.39) converge 
to g(z) = f aA and how do we recover p(x) from g(z)? 
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We have argued before that gy(z) converges to g(z) for very large complex z such 
that the Taylor series around infinity is convergent. The function g(z) is not defined on 
the real axis for z = x on the support of p(x), nevertheless, immediately below (and 
above) the real axis the random function gy (z) converges to g(z). (The case where z is 
right on the real axis is discussed in Section 2.3.6.) Let us study the random function gy (z) 
just below the support of p(x). 

We let z = x — in, with x € supp{p} and 7 is a small positive number. Then 


N N : 
1 1 1 x—hp+in 
en (x — in) = 2 m » 5 (2.42) 


We focus on the imaginary part of gy (x — in) (the real part is discussed in Section 19.5.2). 
Note that it is a convolution of the empirical spectral density py (A) and zr times the Cauchy 
kernel: 


x K(x) = (2.43) 


I 
x2 +n 


The Cauchy kernel K„(x) is strongly peaked around zero with a window width of order 
n (Fig. 2.2). Since there are N eigenvalues lying inside the interval [A_,A+], the typical 
eigenvalue spacing is of order (A, — A_)/N = O(N7!). 


(1) Suppose n K N~!. Then there are typically 0 or 1 eigenvalue within a window of size 
n around x. Then Im gy will be affected by the fluctuations of single eigenvalues of X, 
and hence it cannot converge to any deterministic function. (see Fig. 2.3). 


0.6 


0.4 


Ky (x) 


0.2 


0.0 


Figure 2.2 The Cauchy kernel for n = 0.5. It is strongly peaked around zero with a window width of 
order 7. When 7 — 0, the Cauchy kernel is a possible representation of Dirac’s 5-function. 
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Figure 2.3 Imaginary part of g(x — in) for the Wigner ensemble. The analytic result for n > OF 
is compared with numerical simulations (V = 400). On the left for n = 1/ /N and n = 1. Note 
that for n = 1 the density is quite deformed. On the right (zoom near x = 0) for n = 1/ /N and 
n = 1/N. Note that for n = 1/N, the density fluctuates wildly as only a small number of (random) 
eigenvalues contribute to the Cauchy kernel. 


(2) Suppose N! K n<l(eg.n = N~'/2). Then on a small scale n< Ax < 1, the 
density p is locally constant and there are a great number n of eigenvalues inside: 


n~ No(x)Ax > Nn > 1. (2.44) 


The law of large numbers allows us to replace the sum with an integral; we obtain that 


1 ; x+Ax d 
2 E J a > imp(x), (2.45) 
(x — Ax) +n x— Ax (x —y) +n 


N kth, e[x—Ax, x+Ax] 


where the last limit is obtained by writing u = (y — x)/n and noting that as n — 0 
we have 


T ae (2.46) 
= T7. ` 
-œ u2 +1 


Exercise 2.3.2 Finite N approximation and small imaginary part 
Im gy(x — in)/m is a good approximation to p(x) for small positive n, 
where gy (z) is the sample Stieltjes transform (gy (z) = (1/N) X; 1/(z — àx)). 
Numerically generate a Wigner matrix of size N and o? = 1. 
(a) For three values of n, {1/N,1//N,1}, plot Im gy(x — in)/m and the 
theoretical p(x) on the same plot for x between —3 and 3. 
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(b) Compute the error as a function of 7 where the error is (o(x) — Im gy (x — 
in)/x)* summed for all values of x between —3 and 3 spaced by intervals of 
0.01. Plot this error for 7 between 1/N and 1. You should see that 1/ /N is 
very close to the minimum of this function. 


2.3.4 Stieltjes Inversion Formula 


From the above discussions, we extract the following important results: 
(1) The Stieltjes inversion formula (also called the Sokhotski—Plemelj formula): 


lim Img(x — in) = ro (x). (2.47) 
n—>0+ 


(2) When applied to finite size Stieltjes transform gy (z), we should take N =l ge n<l 
for gy (x — in) to converge to g(x — in) and for (2.47) to hold. Numerically, n = N~!/? 


works quite well. 

We discuss briefly why n = N—!/2 works best. First, we want 7 to be as small as 
possible such that the local density p(x) is not blurred. If 7 is too large, one introduces 
a systematic error of order p’(x)n. On the other hand, we want N7 to be as large as 


possible such that we include the statistics of a sufficient number of eigenvalues so that 
1 


we measure p(x) accurately. In fact, the error between gy and g is of order Na: Thus we 
want to minimize the total error & given by 
/ 1 / : 1 Burgi 
= p (x)n+ —, p (x)ņ : systematic error, _—: statistical error. (2.48) 
Nn Nn 


Then it is easy to see that the total error is minimized when 7 is of order 1/ y Np’(x). 


2.3.5 Density of Eigenvalues of a Wigner Matrix 


We go back to study the Stieltjes transform (2.38) of the Wigner matrix. Note that for 
z = x — in with n — 0, g(z) can only have an imaginary part if vx? — 4c? is imaginary. 
Then, using (2.47), we get the Wigner semi-circle law: 


402 — x2 


2mo? 


p(x) = a lim Img(x — in) = ; 20 <x < 20. (2.49) 
T n->0+ 

Note the following features of the semi-circle law (see Fig. 2.4): (1) asymptotically there 

is no eigenvalue for x > 2ø and x < —2o; (2) the eigenvalue density has square-root 

singularities near the edges: p(x) ~ /x +20 near the left edge and p(x) ~ V20 — x 

near the right edge. For finite N, some eigenvalues are present in a small region of width 

NT?’ around the edges, see Section 14.1. 
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Figure 2.4 Density of eigenvalues of a Wigner matrix with o = 1: the semi-circle law. 


Exercise 2.3.3 From the moments to the density 


(a) 
(b) 


(c) 
(d) 
(e) 


(f) 


A large random matrix has moments t(A‘) = 1/k. 


Using Eq. (2.24) write the Taylor series of g(z) around infinity. 

Sum the series to get a simple expression for g(z). Hint: look up the Taylor 
series of log(1 + x). 

Where are the singularities of g(z) on the real axis? 

Use Eq. (2.47) to find the density of eigenvalues p (à). 

Check your result by recomputing the moments and the Stieltjes transform 
from p(X). 

Redo all the above steps for a matrix whose odd moments are zero and even 
moments are t(A2*) = 1. Note that in this case the density o(A) has Dirac 
masses. 


2.3.6 Stieltjes Transform on the Real Axis 


What about computing the Stieltjes transform when z is real and inside the spectrum? 
This seems dangerous at first sight, since gy (z) diverges when z is equal to one of the 
eigenvalues of X. As these eigenvalues become more and more numerous as N goes to 
infinity, z will always be very close to a pole of the resolvent g y (z). Interestingly, one can 
turn this predicament on its head and actually exploit these divergences. In a hand-waving 
manner, the probability that the difference d; = |z — 4;| between z (now real) and a given 
eigenvalue å; is very small, is given by 


P[d; < €/N] = 2ep(z), (2.50) 


27 
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where p(z) is the normalized density of eigenvalues around z. But as € — 0, i.e. when 
z is extremely close to A;, the resolvent becomes dominated by a unique contribution — 
that of the A; term, all other terms (z — apol, j + i, become negligible. In other words, 


gN (z) © £(Nd;)~!, and therefore 
P[lel > e7 !] = Pld; < €/N] = 2ep (2). (2.51) 


Hence, gy (z) does not converge for N — oo when z is real, but the tail of its distribution 


decays precisely as p(z)/ g?. Studying this tail thus allows one to extract the eigenvalue 
density o(z) while working directly on the real axis. Let us show how this works in the 
case of Wigner matrices. 

The idea is that, for a rotationally invariant problem, the distribution of a randomly 
chosen diagonal element of the resolvent (say G11) is the same as the distribution P(g) 


of the normalized trace.? With this assumption, Eq. (2.29) can be interpreted as giving the 
evolution of P(g) itself, i.e. 


(N) FOS pe ENA) 1 
P~ (8) = / dg P (g )5| 8 - E (2.52) 
= z—og 


ee) 


where we have used the fact that, for large N, Ee Mix M22); Mee => o?g N-D, 


Now, this functional iteration admits the following Cauchy distribution as a fixed point: 
p) 
(g — x25)? + 27 p(z) 


P(g) = (2.53) 
This simple result, that the resolvent of a Wigner matrix on the real axis is a Cauchy vari- 


able, calls for several comments. First, one finds that P% (g) indeed behaves as p(z)/g? 
for large g, as argued above. Second, it would have been entirely natural to find a Cauchy 
distribution for g had the eigenvalues been independent. Indeed, since g is then the sum 
of N random variables (i.e. the 1/d;’s) distributed with an inverse square power, the 
generalized CLT predicts that the resulting sum is Cauchy distributed. In the present case, 
however, the eigenvalues are strongly correlated — see Section 5.1.4. It was recently proven 
that the Cauchy distribution is in fact super-universal and holds for a wide class of point 
processes on the real axis, in particular for the eigenvalues of random matrices. It is in fact 
even true when these eigenvalues are strictly equidistant, with a random global shift. 
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More on Gaussian Matrices* 


In the previous chapter, we dealt with the simplest of all Gaussian matrix ensembles, where 
entries are real, Gaussian random variables, and global symmetry is imposed. It was pointed 
out by Dyson that there exist precisely three division rings that contain the real numbers, 
namely, the real themselves, the complex numbers and the quaternions. He showed that this 
fact implies that there are only three acceptable ensembles of Gaussian random matrices 
with real eigenvalues: GOE, GUE and GsE. Each is associated with a Dyson index called 6 
(1, 2 and 4, respectively) and except for this difference in 6 almost all of the results in 
this book (and many more) apply to the three ensembles. In particular their moments and 
eigenvalue density are the same as N — oo, while correlations and deviations from the 
asymptotic formulas follow families of laws with 6 as a parameter. In this chapter we will 
review the other two ensembles (GUE and GsE),! and also discuss the general moments of 
Gaussian random matrices, for which some interesting mathematical tools are available, 
that are useful beyond RMT. 


3.1 Other Gaussian Ensembles 
3.1.1 Complex Hermitian Matrices 


For matrices with complex entries, the analog of a symmetric matrix is a (complex) Her- 
mitian matrix. It satisfies AT = A where the dagger operator is the combination of matrix 
transposition and complex conjugation. There are two important reasons to study com- 
plex Hermitian matrices. First they appear in many applications, especially in quantum 
mechanics. There, the energy and other observables are mapped into Hermitian operators, 
or Hermitian matrices for systems with a finite number of states. The first large N result 
of random matrix theory is the Wigner semi-circle law. As recalled in the introduction to 
Chapter 2, it was obtained by Wigner as he modeled the energy levels of complex heavy 
nuclei as a random Hermitian matrix. 


! More recently, it was shown how ensembles with an arbitrary value of £ can be constructed, see Dumitriu and Edelman 
[2002], Allez et al. [2012]. 
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The other reason Hermitian matrices are important is mathematical. In the large N limit, 
the three ensembles (real, complex and quaternionic (see below)) behave the same way. 
But for finite N, computations and proofs are much simpler in the complex case. The main 
reason is that the Vandermonde determinant which we will introduce in Section 5.1.4 
is easier to manipulate in the complex case. For this reason, most mathematicians 
discuss the complex Hermitian case first and treat the real and quaternionic cases as 
extensions. In this book we want to stay close to applications in data science and statistical 
physics, so we will discuss complex matrices only in the present chapter. In the rest of 
the book we will indicate in footnotes how to extend the result to complex Hermitian 
matrices. 

A complex Hermitian matrix A has real eigenvalues and it can be diagonalized with 
a suitable unitary matrix U. A unitary matrix satisfies U'U = 1. So A can be written as 
A = UAU’, with A the diagonal matrix containing its N eigenvalues. 

We want to build the complex Wigner matrix: a Hermitian matrix with nD Gaussian 
entries. We will choose a construction that has unitary invariance for every N. Let us study 
the unitary invariance of complex Gaussian vectors. First we need to define a complex 
Gaussian variable. 

We say that the complex variable z is centered Gaussian with variance o? if z = xr +i xi 
where x, and x; are centered Gaussian variables of variance o? /2. We have 


El|z|7] = Ek? + E[x2] = 0°. (3.1) 


A white complex Gaussian vector x is a vector whose components are 11D complex centered 
Gaussians. Consider y = Ux where U is a unitary matrix. Each of the components is a linear 
combination of Gaussian variables so y is Gaussian. It is relatively straightforward to show 
that each component has the same variance ø? and that there is no covariance between 
different components. Hence y is also a white Gaussian vector. The ensemble of a white 
complex Gaussian vector is invariant under unitary transformation. 

To define the Hermitian Wigner matrix, we first define a (non-symmetric) square 
matrix H whose entries are centered complex Gaussian numbers and let X be the Hermitian 
matrix defined by 


X=H+H'. (3.2) 


If we repeat the arguments of Section 2.2.2, we can show that the ensemble of X is invariant 
under unitary transformation: UXU niw ¢ 

We did not specify the variance of the elements of H. We would like X to be normalized 
as t (X?) = o? + O(1/N). Choosing the variance of the H as E[|Hj;|?] = 1/(2N) achieves 
precisely that. 

The Hermitian matrix X has real diagonal elements with [X2] = 1/N and off-diagonal 
elements that are complex Gaussian with E[|X;; |] = 1/N. In other words the real and 
imaginary parts of the off-diagonal elements of X have variance 1/(2N). We can put 
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all this information together in the joint law of the matrix elements of the Hermitian 
matrix H: 


P({Xij}) x apl- Txe]. (3.3) 


This law is identical to the real symmetric case (Eq. 2.17) up to a factor of 2. We can then 
write both the symmetric and the Hermitian case as 


P({Xi;}) « exp EN axe : (3.4) 
J 402 


where £ is 1 or 2 respectively. 

The complex Hermitian Wigner ensemble is called the Gaussian unitary ensemble 
or GUE. 

The results of the previous chapter apply equally to the real symmetric and the 
complex Hermitian case. Both the self-consistent equation for the Stieltjes transform 
and the counting of non-crossing pair partitions (see next section) rely on the independence 
of the elements of the matrix and on the fact that E[|X;; |?] = 1/N, true in both cases. 
We then have that the Stieltjes transform of the two ensembles is the same and they have 
exactly the same semi-circle distribution of eigenvalues in the large N limit. The same will 


be true for the quaternionic case (6 = 4) in the next section, and in fact for all values of 6 
provided NB — oo when N —> ov, see Section 5.3.1: 


402 — }2 


Jno — 20o <i <20. (3.5) 
TO 


pp(a) = 


3.1.2 Quaternionic Hermitian Matrices 


We will define here the quaternionic Hermitian matrices and the GsE. There are many 
fewer applications of quaternionic matrices than the more common real or complex matri- 
ces. We include this discussion here for completeness. In the literature the link between 
symplectic matrices and quaternions can be quite obscure for the novice reader. Except for 
the existence of an ensemble of matrices with 6 = 4 we will never refer to quaternionic 
matrices after this section, which can safely be skipped. 

Quaternions are non-commutative extensions of the real and complex numbers. They 
are written as real linear combinations of the real number 1 and three abstract non- 
commuting objects (i,j, K) satisfying 


P =j =k sik=-1 > ij=-ji=k, jk=—-kj=i, Ki=-ik=j. (3.6) 


So we can write a quaternion as h = xr+i xi +j xj +k xp. If only xr is non-zero we say that 
h is real. We define the quaternionic conjugation as 1* = 1,i* = —i,j* = —j,k* = —k 
so that the norm |h|? := hh* = x2 + xe + x? + a is always real and non-negative. The 


abstract objects i, j and k can be represented as 2 x 2 complex matrices: 


1 0 ; i 0 s 0 1 0 i 
tel Ue i=(4 a i=( a (| 2 (3.7) 


where the i in the matrices is now the usual unit imaginary number. 

Quaternions share all the algebraic properties of real and complex numbers except for 
commutativity (they form a division ring). Since matrices in general do not commute, 
matrices built out of quaternions behave like real or complex matrices. 
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A Hermitian quaternionic matrix is a square matrix A whose elements are quaternions 
and satisfy A = AÏ. Here the dagger operator is the combination of matrix transposition 
and quaternionic conjugation. They are diagonalizable and their eigenvalues are real. 
Matrices that diagonalize Hermitian quaternionic matrices are called symplectic. Written 
in terms of quaternions they satisfy Sst = 1. 

Given representation of quaternions as 2 x 2 complex matrices, an N x N quaternionic 
Hermitian matrix A can be written as a 2N x 2N complex matrix Q(A). We choose a 
representation where 


Z := Qj) = ( ee ). (3.8) 


For a 2N x 2N complex matrix Q to be the representation of a quaternionic Hermitian 
matrix it has to have two properties. First, quaternionic conjugation acts just like Hermitian 
conjugation so QÏ = Q. Second, it has to be expressible as a real linear combination of 
unit quaternions. One can show that such matrices (and only them) satisfy 


QÈ := 2Q7zZ | = QÏ, (3.9) 


where QF is called the dual of Q. In other words an N x N Hermitian quaternionic matrix 
corresponds to a 2N x 2N self-dual Hermitian matrix (i.e. Q = Qt = QR). In this 
2N x 2N representation symplectic matrices are complex matrices satisfying 


ssi = sS =1. (3.10) 


To recap, a 2N x 2N Hermitian self-dual matrix Q can be diagonalized by a symplectic 
matrix S. Its 2N eigenvalues are real and they occur in pairs as they are the N eigenvalues 
of the equivalent Hermitian quaternionic N x N matrix. 

We can now define the third Gaussian matrix ensemble, namely the Gaussian symplec- 
tic ensemble (GSE) consisting of Hermitian quaternionic matrices whose on diagonal ele- 
ments are quaternions with Gaussian distribution of zero mean and variance E[|X; il? ]= 
1/N. This means that each of the four components of each X;; is a Gaussian number of 
zero mean and variance 1/(4N). The diagonal elements of X are real Gaussian numbers 
with zero mean and variance 1/(2N). As usual Xj; = = xX} į SO only the upper (or lower) 


triangular elements are independent. The joint law for the leimenis of a GSE matrix with 
variance t (X2) = ø? is given by 


PAX») exp |= 25 TH? (3.11) 


which we identify with Eq. (3.4) with 6 = 4. This parameter 6 = 4 is a fundamental prop- 
erty of the symplectic group and will consistently appear in contrast with the orthogonal 
and unitary cases, 8 = 1 and £ = 2 (see Section 5.1.4). 

The parameter 6 can be interpreted as the randomness in the norm of the matrix 
elements. More precisely, we have 


ne for real symmetric, 
|X; i = x2 + ae for complex Hermitian, (3.12) 


x2 + x + xp + ie for quaternionic Hermitian, 


where Xr, Xi, Xj, Xk are real Gaussian numbers such that E[|X;; |7] = 1. We see that the 
fluctuations of |X yl? decrease with £ (precisely V[|X ijl] = 2/6). By the law of large 
numbers (LLN), in the 6 — oo limit (if such an ensemble existed) we would have 
IX; = 1 with no fluctuations. 


33 


34 More on Gaussian Matrices 


Exercise 3.1.1 Quaternionic matrices of size one 
The four matrices in Eq. (3.7) can be thought of as the 2 x 2 complex 
representations of the four unit quaternions. 


(a) Define Z := j and compute Z~!. 

(b) Show that for all four matrices Q, we have ZQ’Z~! = QÏ where the dagger 
here is the usual transpose plus complex conjugation. 

(c) Convince yourself that, by linearity, any Q that is a real linear combination of 
the 2 x 2 matrices i, j, k and 1 must satisfy ZQ’Z~! = QÏ. 

(d) Give an example of a matrix Q that does not satisfy ZQ’Z~! = QÏ. 


3.1.3 The Ginibre Ensemble 


The Gaussian orthogonal ensemble is such that all matrix elements of X are nD Gaussian, 
but with the strong constraint that X;; = X ji, which makes sure that all eigenvalues of 
X are real. What happens if we drop this constraint and consider a square matrix H with 
independent entries? In this case, one may choose two different routes, depending on the 
context. 


e One route is simply to allow eigenvalues to be complex numbers. One can then study 
the eigenvalue distribution in the complex plane, so the distribution becomes a two- 
dimensional density. Some of the tools introduced in the previous chapter, such as the 
Sokhotski—Plemelj formula, can be generalized to complex eigenvalues. The final result 
is called the Girko circular law: the density of eigenvalues is constant within a disk 
centered at zero and of radius ø (see Fig. 3.1). In the general case where E[ Hj; Hji] = 
po~, the eigenvalues are confined within an ellipse of half-width (1 + p)o along the 
real axis and (1 — p)o in the imaginary direction, interpolating between a circle for 
p = 0 (independent entries) and a line segment on the real axis of length 40 for p = 
1 (symmetric matrices). 


The other route is to focus on the singular values of H. One should thus study the real 
eigenvalues of H’H when H is a square random matrix made of independent Gaussian 
elements. This is precisely the Wishart problem that we will study in Chapter 4, for the 
special parameter value g = 1. Calling s the square-root of these real eigenvalues, the 
final result is a quarter-circle: 


402? — s? 


p(s) = a s € (0,20). (3.13) 
wo 
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Figure 3.1 Complex eigenvalues of a random N = 1000 matrix taken from the Gaussian Ginibre 


ensemble, i.e. a non-symmetric matrix with 1D Gaussian elements with variance o? = 1 /N. 


The dash line corresponds to the circle al? = 1. As N — œ the density becomes uniform in 
the complex unit disk. This distribution is called the circle law or sometimes, more accurately, the 
disk law. 


Exercise 3.1.2 Three quarter-circle laws 
Let H be a (non-symmetric) square matrix of size N whose entries are 1D 
Gaussian random variable of variance o”/N. Then as a simple consequence 
of the above discussion the following three sets of numbers are distributed 
according to the quarter-circle law (3.13) in the large N limit. Define 


H + H” 
WS ’ 


2| Re A;| where {A;} are the eigenvalues of H, 


wi = |A;| where {A;} are the eigenvalues of 


ri 


si = Ai where {A;} are the eigenvalues of HH’. 


(a) Generate a large matrix H with say M = 1000 and o? = 1 and plot the 
histogram of the three above sets. 

(b) Although these three sets of numbers converge to the same distribution there 
is no simple relation between them. In particular they are not equal. For a 
moderate N (10 or 20) examine the three sets and realize that they are all 
different. 
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3.2 Moments and Non-Crossing Pair Partitions 
3.2.1 Fourth Moment of a Wigner Matrix 


We have stated in Section 2.2.1 that for a Wigner matrix we have t(X*) = 20%. We will 
now compute this fourth moment directly and then develop a technique to compute all other 
moments. 
We have 
1 1 
4 x 4 3 
T(X") = W e[Tr(X")] = 5 [XiX jp Xur Xi]. (3.14) 


N 
ijkl 


Recall that (X;; : 1 <i < j < N) are independent Gaussian random variables of mean 
zero. So for the expectations in the above sum to be non-zero, each X entry needs to be 
equal to another X entry.” There are two possibilities. Either all four are equal or they are 
equal pairwise. In the following we will not distinguish between diagonal and off-diagonal 
terms; as there are many more off-diagonal terms these terms always dominate. 

(1) If Xij = Xjk = Xx; = X);, then 
4 


` 30 
[XX jXuXi] = Ty (3.15) 
and there are N? of them. Thus the total contribution from these terms is 
1 230  30f 
ne = 5 0. (3.16) 


N N“ N 
(2) Suppose there are two different pairs. Then there are three possibilities (see Fig. 3.2): 


(i) X;j = X jk, Xx; = Xı;, and Xij is different than Xj; (i.e. JF 1). Then 


2 
1 5 1 o? 
(3.14) = y Ds [X7,X7)] = win? N?) (=) > ot (3.17) 
i, j+l 
as N > co 
Xiii Xii Xiz, i4 Xi4, i3 


Figure 3.2 Graphical representation of the three terms contributing to t(X*). The last one is a 
crossing partition and has a zero contribution. 


2 When we say that X;; = Xz, we mean that they are the same random variable; given that X is a symmetric matrix it means 
either (i = kand j =/) or (i =/ and j =k). 
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Gi) Xi; = Xii, X jx = Xz, and Xj; is different than Xj, (i.e. i # k). Then 


1 1 2y? 
614 = — ` ER} X J] = >? nd ( ) > of (3.18) 
ioral N N 


as N > oo. 

(iii) Xj; = Xx, X jx = Xı;, and X;; is different than Xj; (i.e. i # k). Then we must have 
i = l and j =k from X;; = Xx, andi = j and k = l from X; = Xj;. This gives a 
contradiction: there are no such terms. 


In sum, we obtain that 
t(X*) > ot +04 =207 (3.19) 


as N->oo, where the two terms come from the two non-crossing partitions, see 
Figure 3.2. 

In the next technical section, we generalize this calculation to arbitrary moments of 
X. Odd moments are zero by symmetry. Even moments t(X?*) can be written as sums 
over non-crossing diagrams (non-crossing pair partitions of 2k elements), where each such 
diagram contributes 02. So 


tX”) = Cro”, (3.20) 


where Cz are Catalan numbers, the number of such non-crossing diagrams. They satisfy 


k k-1 
Cr = 5 Cj-1Ck-j = 5 C;jCk-j-1, (3.21) 
j=l j=0 
with Co = Cı = 1, and can be written explicitly as 
1 2k 
Ck = —— ; 3.22 
k= ee G) Ge?) 


see Section 3.2.3. 


3.2.2 Catalan Numbers: Counting Non-Crossing Pair Partitions 


We would like to calculate all moments of X. As written above, all the odd moments 
T x% +1) vanish (since the odd moments of a Gaussian random variable vanish). We only 
need to compute the even moments: 

1 1 
y EE] DO E C Ran) (3.23) 

lise ink 

Since we assume that the elements of X are Gaussian, we can expand the above 
expectation value using Wick’s theorem using the covariance of the {X;;}’s. The matrix 
X is symmetric, so we have to keep track of the fact that X;; is the same variable as X ;;. 
For this reason, using Wick’s theorem proves quite tedious and we will not follow this 
route here. 
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From the Taylor series at infinity of the Stieltjes transform, we expect every even 
moment of X to converge to an O(1) number as N — oo. We will therefore drop any 
O(1/N) or smaller term as we proceed. In particular the difference of variance between 
diagonal and off-diagonal elements of X does not matter to first order in 1/N. 

In Eq. (3.23), each X entry must be equal to at least one another X entry, otherwise the 
expectation is zero. On the other hand, it is easy to show that for the partitions that contain 
at least one group with > 2 (actually > 4) X entries that are equal to each other, their total 
contribution will be of order O(1/N) or smaller (e.g. in case (1) of the previous section). 
Thus we only need to consider the cases where each X entry is paired to exactly one other 
X entry, which we also referred to as a pair partition. 

We need to count the number of types of pairings of 2k elements that contribute to 
1 (X2*) as N > oo. We associate to each pairing a diagram. For example, for k = 3, we 
have 5!!= 5-3-1 = 15 possible pairings (see Fig. 3.3). 

To compute the contribution of each of these pair partitions, we will compute the 
contribution of non-crossing pair partitions and argue that pair partitions with crossings 
do not contribute in the large N limit. First we need to define what is a non-crossing pair 
partition of 2k elements. A pair partition can be draw as a diagram where the 2k elements 
are points on a line and each point is joined with its pair partner by an arc drawn above that 
line. If at least two arcs cross each other the partition is called crossing, and non-crossing 
otherwise. In Figure 3.3 the five partitions on the left are non-crossing while the ten others 
are crossing. 

In a non-crossing partition of size 2k, there is always at least one pairing between 
consecutive points (the smallest arc). If we remove the first such pairing we get a non- 
crossing pair partition of 2k — 2 elements. We can proceed in this way until we get to 
a paring of only two elements: the unique (non-crossing) pair partition contributing to 
(Fig. 3.4) 


tX?) =o". (3.24) 


D 


Figure 3.3 Graphical representation of the 15 terms contributing to t (X6). Only the five on the left 
are non-crossing and have a non-zero contribution as N > oo. 


Xij X ji 


Figure 3.4 Graphical representation of the only term contributing to t(X2). Note that the indices of 
two terms are already equal prior to pairing. 
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igie+1 Xiesiiew Xiepriess Xiepsiesa 


Figure 3.5 Zoom into the smallest arc of a non-crossing partition. The two middle matrices are paired 
while the other two could be paired together or to other matrices to the left and right respectively. 


After the pairing of X and X we have ig+ = ig43 and the index i+ is free. 


it41, i42 it42, i43? 


We can use this argument to prove by induction that each non-crossing partition con- 
tributes a factor o%. In Figure 3.5, consecutive elements X;;,;,i¢}2 and Xj,,,,i,,, are 
paired; we want to evaluate that pair and remove it from the diagram. The variance con- 
tributes a factor o? /N. We can make two choices for index matching. First consider 
ię+1 = ig43 and ig42 = i¢+2. In that case, the index i¢+2 is free and its summation 
contributes a factor of N. The identity ig, = ig43 means that the previous matrix X 
is now linked by matrix multiplication to the following matrix X 


ig,ie4t 
In other words 
2k—2 


it+1,it4+4* 
we are left with o? times a non-crossing partition of size 2k — 2, which contributes o 
by our induction hypothesis. The other choice of index matching, i¢41 = ig+2 = i¢43, 
can be viewed as fixing a particular value for ig and is included in the sum over i¢42 
in the previous index matching. So by induction we do have that each non-crossing pair 
partition contributes o %. 

Before we discuss the contribution of crossing pair partitions, let’s analyze in terms 
of powers of N the computation we just did for the non-crossing case. The computation 
of each term in t(X2) involves 2k matrices that have in total 4k indices. The trace and 
the matrix multiplication forces 2k equalities among these indices. The normalization of 
the trace and the k variance terms gives a factor of ok /N k+l To get a result of order 


1 we need to be left with k + 1 free indices whose summation gives a factor of N‘+!, 
Each k pairing imposes a matching between pairs of indices. For the first k — 1 choice of 
pairing we managed to match one pair of indices that were already equal. At the last step 
we matched to pairs of indices that were already equal. Hence in total we added only k+ 1 
equality constraints which left us with k + 1 free indices as needed. 

We can now argue that crossing pair partitions do not contribute in the large N limit. 
For crossing partition it is not possible to choose a matching at every step that matches 
a pair of indices that are already equal. If we use the previous algorithm of removing at 
each step the leftmost smallest arc, at some point, the smallest arc will have a crossing 
and we will be pairing to matrices that share no indices, adding two equality constraints 
at this step. The result will therefore be down by at least a factor of 1/N with respect to 
the non-crossing case. This argument is not really a proof but an intuition why this might 
be true.? 

We can now complete our moments computation. Let 


Cx := # of non-crossing pairings of 2k elements. (3.25) 


Since every non-crossing pair partition contributes a factor ok 


crossing pairings we immediately get that 


, Summing over all non- 


t (X) = Co ™%. (3.26) 


3 A more rigorous proof can be found in e.g. Anderson et al. [2010], Tao [2012] or Mingo and Speicher [2017]. In this last 
reference, the authors compute the moments of X exactly for every N (when ot = a2). 
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Figure 3.6 In a non-crossing pairing, the paring of site 1 with site 2 j splits the graph into two disjoint 
non-crossing parings. 


3.2.3 Recursion Relation for Catalan Numbers 


In order to compute the Catalan numbers Cg, we will write a recursion relation for 
them. Take a non-crossing pairing, site 1 is linked to some even site 2j (it is easy to see 
that 1 cannot link to an odd site in order for the partition to be non-crossing). Then the 
diagram is split into two smaller non-crossing pairings of sizes 2(j — 1) and 2(k — j), 
respectively (see Fig. 3.6). Thus we get the inductive relation* 


a= ve -1Ck- Xo Ciji (3.27) 
j=1 
where we let Co = Cı = 1. One can then prove by induction that Cx is given by the 


Catalan number: 
1 2k 
Ck = —— y 3.28 

eT EJ (7) Ce 


Using the Taylor series for the Stieltjes transform (2.22), we can use the Catalan number 
recursion relation to find an equation for the Stieltjes transform of the Wigner ensemble: 


TOE DIELA (3.29) 


Thus, using (3.27), we obtain that 


oo o2k k=1 
a(Z) — = ot ZHI Do CjCe—j-1 
k=1 j=0 
2 © oo 
= ee ee ol S Ck-j-1__xk-j-1) 
me I 22k—j—D+1 
j=? k=j+1 
2 oo oo 2 
o Ce x g= 2 
=- |} sa? | =o, (3.30) 
2j+17 UFI 9 
z a yoy z 


which gives the same self-consistent equation for g(z) as in (2.35) and hence the same 
solution: 


gz) = (3.31) 


4 Interestingly, this recursion relation is also found in the problem of RNA folding. For deep connections between the physics of 
RNA and RMT, see Orland and Zee [2002]. 
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The same result could have been derived by substituting the explicit solution for the 
Catalan number Eq. (3.28) into (3.29), but this route requires knowledge of the Taylor 
series: 


[0,6] 


M=- (3.32) 


k=0 


Exercise 3.2.1 Non-crossing pair partitions of eight elements 


(a) Draw all the non-crossing pair partitions of eight elements. Hint: use the 
recursion expressed in Figure 3.6. 
(b) If Xis a unit Wigner matrix, what is t (x8)? 
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Wishart Ensemble and Maréenko—Pastur Distribution 


In this chapter we will study the statistical properties of large sample covariance matrices of 
some N-dimensional variables observed T times. More precisely, the empirical set consists 
of N x T data {xj }i<i<N,1<t<7T, Where we have T observations and each observation 
contains N variables. Examples abound: we could consider the daily returns of N stocks, 
over a certain time period, or the number of spikes fired by N neurons during T consecutive 
time intervals of length Aż, etc. Throughout this book, we will use the notation q for the 
ratio N/T. When the number of observations is much larger than the number of variables, 
one has q < 1. If the number of observations is smaller than the number of variables (a 
case that can easily happen in practice), then g > 1. 

In the case where q — 0, one can faithfully reconstruct the “true” (or population) 
covariance matrix C of the N variables from empirical data. For q = O(1), on the other 
hand, the empirical (or sample) covariance matrix is a strongly distorted version of C, even 
in the limit of a large number of observations. This is not surprising since we are trying 
to estimate O(N /2) matrix elements from O(NT) observations. In this chapter, we will 
derive the well-known Maréenko-—Pastur law for the eigenvalues of the sample covariance 
matrix for arbitrary values of q, in the “white” case where the population covariance matrix 
C is the identity matrix C = 1. 


4.1 Wishart Matrices 
4.1.1 Sample Covariance Matrices 


We assume that the observed variables H have zero mean. (Otherwise, we need to remove 
the sample mean T7! J, x} from x/ for each i. For simplicity, we will not consider this 
case.) Then the sample covariances of the data are given by 


1 T 
Ey= 5 Ss (4.1) 
t=1 


Thus £;; form an N x N matrix E, called the sample covariance matrix (SCM), which we 
we write in a compact form as 
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1 

E=— 

T 
where H is an N x T data matrix with entries Hj; = x . 
The matrix E is symmetric and positive semi-definite: 


HH’, (4.2) 


E=E’, and v'Ev=(1/T)||H’v|? > 0, (4.3) 


for any v € R”. Thus E is diagonalizable and has all eigenvalues aE > 0. 
We can define another covariance matrix by transposing the data matrix H: 


F= “WH. (4.4) 
N 

The matrix F is a T x T matrix, it is also symmetric and positive semi-definite. If the index 
i (1 <i < N) labels the variables and the index t (1 < t < T) the observations, we can 
call the matrix F the covariance of the observations (as opposed to E the covariance of the 
variables). F;; measures how similar the observations at t are to those at s — in the above 
example of neurons, it would measure how similar is the firing pattern at time ¢ and at 
time s. 

As we saw in Section 1.1.3, the matrices TE and NF have the same non-zero eigenval- 
ues. Also the matrix E has at least N — T zero eigenvalues if N > T (and F has at least 
T — N zero eigenvalues if T > N). 

Assume for a moment that N < T (i.e. g < 1), then we know that F has N (zero or 
non-zero) eigenvalues inherited from E and equal to Gah and T — N zero eigenvalues. 
This allows us to write an exact relation between the Stieltjes transforms of E and F: 


T 
1 1 

ays F 
mere eae 


: > E = = Bye 
T iat z—0 


kisma 


gE) 


1-q 
= en(q2) + —. (4.5) 
A similar argument with T < N leads to the same Eq. (4.5) so it is actually valid for any 


value of q. The relationship should be true as well in the large N limit: 


1 SS 
or(z) = q’ (qz) + -S (4.6) 


4.1.2 First and Second Moments of a Wishart Matrix 


We now study the scm E. Assume that the column vectors of H are drawn independently 
from a multivariate Gaussian distribution with mean zero and “true” (or “population”) 
covariance matrix C, i.e. 


SULH; H js] = Cijôts, (4.7) 
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with, again, 


1 T 
E = 7HB’. (4.8) 


Sample covariance matrices of this type were first studied by the Scottish mathematician 
John Wishart (1898—1956) and are now called Wishart matrices. 

Recall that if (X1, ..., X27) is a zero-mean multivariate normal random vector, then by 
Wick’s theorem, 


XX2 Xal] = D> [PE XA= >) [] Cov(xi,x)), (4.9) 


pairings pairs pairings pairs 


where 2 painings I [pairs means that we sum over all distinct pairings of {X1,...,X2,} and 
each summand is the product of the n pairs. 
First taking expectation, we obtain that 


1 


T T 
: A 1 
SLE; j] = T È Ht | =F 2S = Cij. (4.10) 


Thus, we have E[E] = C: as it is well known, the scM is an unbiased estimator of the true 
covariance matrix (at least when ax! ]=0). 

For the fluctuations, we need to study the higher order moments of E. The second 
moment can be calculated as 


1 
T(E’) ari i [Tr(HH" HH’ )| = F a 2 [Hi Hj Hjs His | . (4.11) 


LTE 


Then by Wick’s theorem, we have (see Fig. 4.1) 


FEY = 5g DG + yD LCi + a I 


ts i,j ts i,j ts i,j 
= 1(C’) + TC? + Te). (4.12) 


Suppose N,T — œo with some fixed ratio N/T = q for some constant q > 0. The last 
term on the right hand side then tends to zero and we get 


Hir Hjt js His 44 Hjs His Hir 


ba His 


Figure 4.1 Graphical representation of the three Wick’s contractions corresponding to the three terms 
in Eq. (4.12). 
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2 2 2 
t(E~) > t(C*) + qt(C)’. (4.13) 


The variance of the scm is greater than that of the true covariance by a term proportional to 
q. When g — 0 we recover perfect estimation and the two matrices have the same variance. 
If C = a1 (a multiple of the identity) then t (C?) — t (C)? = 0 but t (E?) — t (E)? > qa?. 


4.1.3 The Law of Wishart Matrices 


Next, we give the joint distribution of elements of E. For each fixed column of H, the joint 
distribution of the elements is 


P (Cia) = 


exp 
y (27) det C 
Taking the product over 1 < t < T (since the columns are independent), we obtain 


1 1 
P (H) = exp | Tr (H’C"'H | 
(20) > det CT/2 2 ( 


f xp |-5 (om) ; (4.15) 


= —_,, ooe 
(20) > det C7/2 


1 
5 2 Hi (C);;' Hj : (4.14) 
ij 


Let us now make a change in variables H — E. As shown in the technical paragraph 4.1.4, 
the Jacobian of the transformation is proportional to (det E) == The following exact 


expression for the law of the matrix elements was obtained by Wishart: ! 


C2! Gane}? T i 
POS a GaGte exp | = Tr (EC )]. (4.16) 


where Ty is the multivariate gamma function. Note that the density is restricted to positive 
semi-definite matrices E. The Wishart distribution can be thought of as the matrix general- 
ization of the gamma distribution. Indeed for N = 1, P(E) reduces to a such a distribution: 


~ T@) 


where b = T/(2C) and a = T/2. Using the identity det E = exp(Tr log E), we can rewrite 
the above expression as? 


Py (x) ae mae (4.17) 


! Note that the Wishart distribution is often given with the normalization E[E] = TC as opposed to E[E] = C used here. 
Complex and quaternionic Hermitian white Wishart matrices have a similar law of the elements with a factor of £ in the 
exponential: 


P (W) «exp [-S rvw] with V(x) vt E ee Da (4.18) 


with £ equal to 1, 2 or 4 as usual. The large N limit V(x) is the same in all three cases and is given by Eq. (4.21). 
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(F272 1 
Py(T/2) (Get O72 P | 


T-N-1 
—_____ Trlog E| _ (4.19) 


P (Œ) = 7 


A (ec!) $ 


We will denote by W a scm with C = 1 and call such a matrix a white Wishart matrix. 
In this case, as N,T — oo with q := N/T, we get that 


P (W) x exp -5 Tr vw] 5 (4.20) 
where 


V(W) := (1—q7!)logW+q7!W. (4.21) 


Note that the above P(W) is rotationally invariant in the white case. In fact, if a vector 
v has Gaussian distribution N (0, Lyxy), then Ov has the same distribution N(0, ly x1) 
for any orthogonal matrix O. Hence OH has the same distribution as H, which shows that 
OEO” has the same distribution as E. 


4.1.4 Jacobian of the Transformation H —> E 


The aim here is to compute the volume Y(E) corresponding to all H’s such that 
E=77'HH’: 

Y(E) = f dHô(Œ — T~'HH’). (4.22) 

Note that this volume is the inverse of the Jacobian of the transformation H — E. Next 

note that one can choose E to be diagonal, because one can always rotate the integral over 

H to an integral over OH, where O is the rotation matrix that makes E diagonal. Now, 


introducing the Fourier representation of the 6 function for all N(N + 1)/2 independent 
components of E, one has 


YŒ) = f dHdA exp (i Tr(AE — T-'AHH’)) , (4.23) 
where A is the symmetric matrix of the corresponding Fourier variables, to which we add a 


small imaginary part proportional to 1 to make all the following integrals well defined. The 
Gaussian integral over H can now be performed explicitly for all t = 1,...,7, leading to 


f dHexp (-ir-! Tr(AHH’)) œ (det A)~2/2, (4.24) 
leaving us with 


T(E) « 1 dA exp (i Tr(AE)) (det A) 77/2. (4.25) 


14 
We can change variables from A to B = E? AE2. The Jacobian of this transformation is 
—1 Zd 
[]o4i ] [44y = J [Ep [EE] [ai ]] Bi 
i j>i i j>i i j>i 


= (det(E))~ “2 | [Ba | | aBi;. (4.26) 


i j>i 
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So finally, 


T-N-1 


T(E) « | f dB exp (i Tr(B)) cet)" (det(E)) 2, (4.27) 


as announced in the main text. 


4.2 Maréenko-Pastur Using the Cavity Method 
4.2.1 Self-Consistent Equation for the Resolvent 


We first derive the asymptotic distribution of eigenvalues of the Wishart matrix with C = 1, 
i.e. the Maréenko—Pastur distribution. We will use the same method as in the derivation of 
the Wigner semi-circle law in Section 2.3. In the case C = 1, the N x T matrix H is filled 
with 1D standard Gaussian random numbers and we have W = (1/T)HH’. 

As in Section 2.3, we wish to derive a self-consistent equation satisfied by the Stieltjes 
transform: 


gw(z) = t (Gw(z)), Gw(z) := z1- W)!. (4.28) 


We fix a large N and first write an equation for the element 11 of Gw(z). We will argue 
later that G11 (z) converges to g(z) with negligible fluctuations. (We henceforth drop the 
subscript W as this entire section deals with the white Wishart case.) 

Using again the Schur complement formula (1.32), we have that 


1 


——— = My; — M2 (M2) !M3;, 4.29 
Goi 11 12 (M22) 21 (4.29) 


where M := z1 — W, and the submatrices of size, respectively, [M1;] = 1 x 1, [Mj2] = 
1 x (N —1), [M21] = (N — 1) x 1, [M22] = (N — 1) x (N — 1). We can expand the above 
expression and write 


1 
(G(z)) 11 


T N 
=2-Wi 7D 2 Hı: H;; (M2) ks Ais. (4.30) 
s=1 j,k=2 


Note that the three matrices M22, Hj; (j > 2) and Hys (k > 2) are independent of the 
entries Hı; for all t. We can write the last term on the right hand side as 


N N 

1 . 1 zs 

T ) HyQrsHis with Qys = T ) Hj: (M22); His. (4.31) 
t,s=1 j,k=2 
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Provided y? := T7! Tr Q? converges to a finite limit when T —> œ, 


one readily shows 
1 
that the above sum converges to T7! Tr & with fluctuations of the order of y T72. So we 


have, for large T, 
1 1 X, Hru Aj =] ab 
-:-W E HoH gy! +o (r-) 
GO ac 5 T M2); ( 


2<j,k<N 


1 = i 
=z-Wi- T > Wij (Moo) 5 +0O(T >) 
2<j,k<N 


1 1 
=z-1- 4 TWG) +0(7 2), (4.32) 


where in the last step we have used the fact that W;; = 1 + O(T~2) and noted W> and 
G2(z) the scm and resolvent of the N — 1 variables excluding (1). We can rewrite the trace 
term: 


Tr(W2G2(z)) = Tr (W21 — Wa)~') 


= —Trl+2zTr (1 4 W2)~') 
= — Tr 1 + zTrG)(z). (4.33) 


In the region where Tr G(z)/N converges for large N to the deterministic g(z), Tr G2(z)/N 
should also converge to the same limit as G2(z) is just an (N — 1) x (N — 1) version of 
G(z). So in the region of convergence we have 

1 


(Ge) 
ae 


where we have introduced q = N/T = O(1), such that N~2 and T~2 are of the same 
order of magnitude. This last equation states that 1/G11(z) has negligible fluctuations and 
can safely be replaced by its expectation value, i.e. 


1 1 1 
———— =F| —_—___|+0(n-? 
Gun Eok (x=) 


z-1+q—qza2) +0 (N7), (4.34) 


1 1 
= —_ + O(N 2). 4.35 
Toni te O>) ja 
By rotational invariance of W, we have 
1 
(G(z)11] = N i(Tr(G(z))] > g(2). (4.36) 


3 TIt can be self-consistently checked from the solution below that lim7 — oo y2 = —q9w (z). 
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In the large N limit we obtain the following self-consistent equation for g(z): 


1 men 
gz) 


z— l +q —qzg(2). (4.37) 


4.2.2 Solution and Density of Eigenvalues 
Solving (4.37) we obtain 


+q- 1+ y(z+q- 1} -— 4qz 


rA 
g9) = 
g(z) az 


(4.38) 


The argument of the square-root is quadratic in z and its roots (the edge of spectrum) are 
given by 


At = (l1 Yq)’. (4.39) 


Finding the correct branch is quite subtle, this will be the subject of Section 4.2.3. We will 
see that the form 
Ud —q)— a a a 


gz) = = ; (4.40) 
qz 


has all the correct analytical properties. Note that for z = x—inņ with x + 0 and n — 0, g (z) 
can only have an imaginary part if V (x — à+)(x — A_) is imaginary. Then using (2.47), 
we get the famous Maréenko-—Pastur distribution for the bulk: 


/O4—)@—A_) 
2m qx 


1 
p(x) = — lim Img —in) = ee ee ee Oe (4.41) 
T n>0+ 


Moreover, by studying the behavior of Eq. (4.40) near z = 0 one sees that there is a pole 
at 0 when q > 1. This gives a delta mass as z — 0: 


talg (4.42) 
q 


which corresponds to the N—T trivial zero eigenvalues of E in the N > T case. Combining 
the above discussions, the full Maréenko—Pastur law can be written as 


[A,-—o@-rAD], g1 
PMP (x) = + d(x)O(q — 1), (4.43) 
2 qx q 


where we denote [a]+ := max{a, 0} for any a € R, and 


Oq- 1) := D (4.44) 
1 ifg>1. 
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pla) 


Figure 4.2 Maréenko—Pastur distribution: density of eigenvalues for a Wishart matrix for q = 1/2 
and q = 2. Note that for q = 2 there is a Dirac mass at zero (56(A)). Also note that the two bulk 


densities are the same up to a rescaling and normalization pj /g (A) = q? Pq (qa). 


Note that the Stieltjes transforms (Eq. (4.40)) for g and 1/q are related by Eq. (4.5). 
As a consequence the bulk densities for q and 1/q are the same when properly rescaled 
(see Fig. 4.2): 


Pija) = 4° Pq (Qa). (4.45) 


Exercise 4.2.1 Properties of the Maréenko—Pastur solution 
We saw that the Stieltjes transform of a large Wishart matrix (with q = N/T) 
should be given by 


+q-1+4 J@+q—)?— 492 


A 
i= 
g(z) Dae 


(4.46) 


where the sign of the square-root should be chosen such that g(z) > 1/z when 
Ko BEC), 
(a) Show that the zeros of the argument of the square-root are given by A+ = 


(ESN 
(b) The function 


z+q-1- Vj —A_-) JZ — Ad) 
2qz 


gz) = (4.47) 


should have the right properties. Show that it behaves as g(z) — 1/z when 
z — too. By expanding in powers of 1/z up to 1/z? compute the first and 
second moments of the Wishart distribution. 
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(c) 


(d) 


(e) 


(f) 
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Show that Eq. (4.47) is regular at z = 0 when q < 1. In that case, compute 
the first inverse moment of the Wishart matrix t(E~!). What happens when 
q — 1? Show that Eq. (4.47) has a pole at z = 0 when q > 1 and compute 
the value of this pole. 

The non-zero eigenvalues should be distributed according to the Marcenko— 
Pastur distribution: 


EE 


4.48 
2m qx ( ) 


Pq (x) = 


Show that this distribution is correctly normalized when q < 1 but not when 
q > 1. Use what you know about the pole at z = 0 in that case to correctly 
write down pq (x) when q > 1. 

In the case q = 1, Eq. (4.48) has an integrable singularity at x = 0. Write 
a simpler formula for p;(x). Let u be the square of an eigenvalue from a 
Wigner matrix of unit variance, i.e. u = y? where y is distributed according 
to the semi-circular law p(y) = \/4— y2/(2z). Show that u is distributed 
according to p1(x). This result is a priori not obvious as a Wigner matrix is 
symmetric while the square matrix H is generally not; nevertheless, moments 
of high-dimensional matrices of the form HH” are the same whether the 
matrix H is symmetric or not. 

Generate three matrices E = HH’/T where the matrix H is an N x T 
matrix of 11D Gaussian numbers of variance 1. Choose a large N and three 
values of T such that g = N/T equals {1/2, 1,2}. Plot a normalized histogram 
of the eigenvalues in the three cases vs the corresponding Maréenko—Pastur 
distribution; don’t show the peak at zero. In the case q = 2, how many zero 
eigenvalues do you expect? How many do you get? 


4.2.3 The Correct Root of the Stieltjes Transform 


In our study of random matrices we will often encounter limiting Stieltjes transforms 
that are determined by quadratic or higher order polynomial equations, and the problem 
of choosing the correct solution (or branch) will come up repeatedly. 

Let us go back to the unit Wigner matrix case where we found (see Section 2.3.2) 


g(z) (4.49) 


me Sa Vz? —4 
Se 


On the one hand we want g(z) that behaves like 1/z as |z| —> oo and we want the solution 
to be analytical everywhere but on the real axis in [—2,2]. The square-root term must 
thus behave as —z for real z when z — oo. The standard definition of the square-root 


behaves as yz? ~ |z| and cannot be made to have the correct sign on both sides. Another 
issue with yz? — 4 is that it has a more extended branch cut than allowed. We expect the 
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branch cul zec 
present in 


branch cut 


absent in 
ee Stieltjes 


Stieltjes 


branch point 


Figure 4.3 The branch cuts of f(z) = vyz? —4. The vertical branch cut (for z pure imaginary) 
should not be present in the Stieltjes transform of the Wigner ensemble. We have f(-EOT + ix) ~ 


+i v x? + 4; this branch cut can be eliminated by multiplying f(z) by sign(Re z). 


function g(z) to be analytic everywhere except for real z € [—2,2]. The branch cut of a 
standard square-root is a set of points where its argument is real and negative. In the case 


of yz? — 4, this includes the interval [—2, 2] as expected but also the pure imaginary line 
z = ix (Fig. 4.3). The finite N Stieltjes transform is perfectly regular on the imaginary 
axis so we expect its large N to be regular there as well. 

For the unit Wigner matrix, there are at least three solutions to the branch problem: 


gız) = ~; 3 (4.50) 
a 0 ee) 

v0 = 5 a (4.51) 
ag [2 

a3(z) = = =e? aig (4.52) 


All three definitions behave as g(z) ~ 1/z at infinity. For the second one, we need to define 
the square-root of a negative real number. If we define it as i ./]z], the two factors of i give 
a —1 for real z < —2. The three functions also have the correct branch cuts. For the first 
one, one can show that the argument of the square-root can be a negative real number only 
if z € (—2,2), there are no branch cuts elsewhere in the complex plane. For the second 
one, there seems to be a branch cut for all real z < 2, but a closer inspection reveals 
that around real z < —2 the function has no discontinuity as one goes up and down the 
imaginary axis, as the two branch cuts exactly compensate each other. For the third one, 
the discontinuous sign function exactly cancels the branch cut on the pure imaginary axis 
(Fig. 4.3). 
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For z with a large positive real part the three functions are clearly the same. Since they 
are analytic functions everywhere except on the same branch cuts, they are the same and 
unique function g(z). 


For a Wigner matrix shifted by Ag and of variance o? we can scale and shift the 
eigenvalues, now equal A+ = Ap + 20 and find 
PONG te 
9 j ho)? 
gız) = ; (4.53) 
20 
Z — àQ — ~Z — À+ yz — AL 
92(z) = + (4.54) 
20 
z — ìo — sign(Re z — Ag) V (z — Ag)? — 402 
g3 (z) = ; (4.55) 


20 


The three definitions are still equivalent as they are just the result of a shift and a scaling 
of the same function. 

For more complicated problems, writing explicitly any one of these three prescriptions 
can quickly become very cumbersome (except maybe in cases where A4 + à— = 0). We 
propose here a new notation. When finding the correct square-root of a second degree 
polynomial we will write 


aan be ben Vay z— à+} yz—ì— 


A 
À 1 
Gi V Ao)? 


= sign(Rez — do) Vaz? + bz + c, (4.56) 


fora > 0 and where A+ = Ag+ VA are the roots of az? + bz + c assumed to be real. 
While the notation is defined everywhere in the complex plane, it is easily evaluated for 
real arguments: 


(4.57) 


— Vax? +bx+c E 
Vast corre | ax“ +bx+c forx <a, 


Vax? + bx +c for x > A+. 


The value on the branch cut is ill-defined but we have 


lim Sag? +bz+e=iVlax2+bx+cl fora. <x <A4. (4.58) 


z>x—i0t 


With our new notation, we can now safely write for the white Wishart: 


_ztq-1- Ye+q—1?— 442 


gz) = (4.59) 
2qz 
or, more explicitly, using the second prescription: 
Z+q—-1—-— Vz-As Vz -AL 
a) = 4 i , (4.60) 


2qz 


where à+ = (1+ q}. 


4.2 Maréenko—Pastur Using the Cavity Method 55 


Exercise 4.2.2 Finding the correct root 


(a) For the unit Wigner Stieltjes transform show that regardless of choice of sign 
in Eq. (4.49) the point z = 21 is located on a branch cut and the function is 
discontinuous at that point. 

(b) Compute the value of Eqs. (4.50), (4.51) and (4.52) at z = 2i. Hint: for g2(z) 
write —2 + 2i = /8e%!7/4 and similarly for 2 + 2i. The definition 93(z) is 
ambiguous for z = 2i, compute the limiting value on both sides: z = 0* + 2i 
and z = 07 + 2i. 


4.2.4 General (Non-White) Wishart Matrices 


Recall our definition of a Wishart matrix from Section 4.1.2: a Wishart matrix is a matrix 
Ec defined as 


Ec = THH, (4.61) 
where Hc is an N x T rectangular matrix with independent columns. Each column is a ran- 
dom Gaussian vector with covariance matrix C; Eç corresponds to the sample (empirical) 
covariance matrix of variables characterized by a population (true) covariance matrix C. 

To understand the case where the true matrix C is different from the identity we first 
discuss how to generate a multivariate Gaussian vector with covariance matrix C. We 
diagonalize C as 


of 
C=040', A= ES : (4.62) 
ow 
The square-root of C can be defined as* 
o1 
C? =042?0", A?= s (4.63) 
ON 


We now generate N nD unit Gaussian random variables xj, 1 < i < N, which form a 
random column vector x with entries x;. Then we can generate the vector y = Cox. We 
claim that y is a multivariate Gaussian vector with covariance matrix C. In fact, y is a linear 
combination of multivariate Gaussians, so it must itself be multivariate Gaussian. On the 
other hand, we have, using E[xx’] = 1, 


4 This is the canonical definition of the square-root of a matrix, but this definition is not unique — see the technical paragraph 
below. 
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1 1 
i(yy’) = E(C?2xx’C2) = C. (4.64) 
By repeating the argument above for every column of Hc, t = 1,...,7, we see that this 


matrix can be written as Hc = C7H, with H a rectangular matrix with nD unit Gaussian 


entries. The matrix Ec is then equivalent to 


1 i Des wet Ll 1 1 
Ec = zHcHc = zC HH C2 = C2W,C?, (4.65) 


where W, = +HH’ is a white Wishart matrix with q = N/T. 


We will see later that the above combination of matrices is called the free product of C 
and W. Free probability will allow us to compute the resolvent and the spectrum in the case 


of a general C matrix. 


The variables x defined above are called the “whitened” version of y. If a zero mean 


random vector y has positive definite covariance matrix C, we can define a whitening of 
y as a linear combination x = My such that E[xx’] = 1. One can show that the matrix 


1 
M satisfies M7M = C7! and has to be of the form M = OC 2, where O can be 


1 
any orthogonal matrix and C2 the symmetric square-root of C defined above. Since O is 
arbitrary the procedure is not unique, which leads to three interesting choices for whitened 
varaibles: 


e Perhaps the most natural one is the symmetric or Mahalanobis whitening where M = 


1 
C~ 2. In addition to being the only whitening scheme with a symmetric matrix M, the 


1 
white variables x = C~ Zy are the closest to y in the following sense: the distance 
Ix- yllce == ETr[(x -e — y)] (4.66) 


is minimal over all other choices of O for any œ. The case œ = —1 is called the 
Mahalanobis norm. 

Triangular or Gram-Schmidt whitening where the vector x can be constructed using the 
Gram-Schmidt orthonormalization procedure. If one starts from the bottom with xy = 
yn//Cwn, then the matrix M is upper triangular. The matrix M can be computed 
efficiently using the Cholesky decomposition of C—!. The Cholesky decomposition of 
a symmetric positive definite matrix A amounts to finding a lower triangular matrix L 
such that LL’ = A. In the present case, A = C7! and the matrix M we are looking for 
is given by 


M=L/’. (4.67) 


This scheme has the advantage that the whitened variable x; only depends on physical 
variables yg for £ > k. In finance, for example, this allows one to construct whitened 
returns of a given stock using only the returns of itself and those of (say) more liquid 
stocks. 

Eigenvalue (or PCA) whitening where O corresponds to the eigenbasis of C, i.e. such 


l 
that C = OAO” where A is diagonal. M is then computed as M = A~ 207. The 
whitened variables x are then called the normalized principal components of y. 
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Joint Distribution of Eigenvalues 


In the previous chapters, we have studied the moments, the Stieltjes transform and the 
eigenvalue density of two classical ensembles (Wigner and Wishart). These quantities in 
fact relate to single eigenvalue properties of these ensembles. By this we mean that the 
Stieltjes transform and the eigenvalue density are completely determined by the univariate 
law of eigenvalues but they do not tell us anything about the correlations between different 
eigenvalues. 

In this chapter we will extend these results in two directions. First we will consider a 
larger class of rotationally invariant (or orthogonal) ensembles that contains Wigner and 
Wishart. Second we will study the joint law of all eigenvalues. In these models, the eigen- 
values turn out to be strongly correlated and can be thought of as “particles” interacting 
through pairwise repulsion. 


5.1 From Matrix Elements to Eigenvalues 
5.1.1 Matrix Potential 


Consider real symmetric random matrices M whose elements are distributed as the expo- 
nential of the trace of a certain matrix function V(M), often called a potential by analogy 
with statistical physics:! 


PM) = Z5’ exp {Steve} (5.1) 


where Zy is a normalization constant. These matrix ensembles are called orthogonal 
ensembles for they are rotationally invariant, i.e. invariant under orthogonal transforma- 
tions.” For the Wigner ensemble, for example, we have (see Chapter 2) 


1 V(M) is a matrix function best defined in the eigenbasis of M through a transformation of all its eigenvalues through a 
function of a scalar, V (x), see Section 1.2.6. 
The results of this chapter extend to Hermitian (6 = 2) or quarternion-Hermitian (8 = 4) matrices with the simple 
introduction of a factor £ in the probability distribution: 


POD cx exp f- fX m vo}, (5.2) 
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V(x) 


Figure 5.1 The Wishart matrix potential (Eq. (5.4)) for q = 1/2 and q = 2. The integration over 
positive semi-definite matrices imposes that the eigenvalues must be greater than or equal to zero. 
For q < 1 the potential naturally ensures that the eigenvalues are greater than zero and the constraint 
will not be explicitly needed in the computation. For g > 1, the constraint is needed to obtain a 
sensible result. 


y2 
V(x) = —, 5.3 
(«= 55 (5.3) 
whereas the Wishart ensemble (at large N) is characterized by (see Chapter 4) 
— 1)1 
V(x) = x + @— I) logx (5.4) 
q 
(see Fig. 5.1). We can also consider other matrix potentials, e.g. 
ij pii (5.5) 
x)= — +=. . 
2 4 


Note that Tr V(M) depends only on the eigenvalues of M. We would like thus to write 
down the joint distribution of these eigenvalues alone. The key is to find the Jacobian of the 
change of variables from the entries of M to the eigenvalues {A1,...,Ay}. 


5.1.2 Matrix Jacobian 


Before computing the Jacobian of the transformation from matrix elements to eigenvalues 
and eigenvectors (or orthogonal matrices), let us count the number of variables in both 
parameterizations. Suppose M can be diagonalized as 


M=OAO’. (5.6) 


this factor will match the factor of £ from the Vandermonde determinant. These two other ensembles are called unitary 
ensembles and symplectic ensembles, respectively. Collectively they are called the beta ensembles. 
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The symmetric matrix M has N(N + 1)/2 independent variables, and A has N inde- 
pendent variables as a diagonal matrix. To find the number of independent variables in 
O we first realize that OO’ = 1 is an equation between two symmetric matrices and 
thus imposes N(N + 1)/2 constraints out of N? potential values for the elements of O, 
therefore O has N(N — 1)/2 independent variables. In total, we thus have N(N + 1)/2 = 
N+ N(N — 1)/2. 

The change of variables from M to (A, O) will introduce a factor | det(A)|, where 


(5.7) 


A = AM) = E | 


3A’ 30 


is the Jacobian matrix of dimension N(N + 1)/2 x N(N + 1)/2. 

First, let us establish the scaling properties of the Jacobian. We assume that the matrix 
elements of M have some dimension [M] = d (say centimeters). Using dimensional 
analysis, we thus have 


[DM] =dN“+Y? [Daj =d%, [DO] =a", (5.8) 
since rotations are dimensionless. Hence we must have 
[| det(A)|] ~ aNA-D?, (5.9) 


which has the dimension of an eigenvalue raised to the power N(N — 1)/2, the number of 
distinct off-diagonal elements in M. 

We now compute this Jacobian exactly. First, notice that the Jacobian relates the volume 
“around” (A,O) when A and O change by infinitesimal amounts, to the volume “around” 
M when its elements change by infinitesimal amounts. We note that volumes are invariant 
under rotations, so in order to compute the infinitesimal volume we can choose the rotation 
matrix O to be the identity matrix, which amounts to saying that we work in the basis where 
M is diagonal. Another way to see this is to note that the orthogonal transformation 


M>U'MU; U’U=1 (5.10) 


has a Jacobian equal to 1, see Section 1.2.7. One can always choose U such that M is 
diagonal. 


5.1.3 Infinitesimal Rotations 


For rotations O near the identity, we set 
O=1+€60, (5.11) 
where € is a small number and 5O is some matrix. From the identity 


1 = OO” = 1 + (504+ 80’) + €75080', (5.12) 
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we get 50 = —5O’ by comparing terms of the first order in €, i.e. 5O is skew-symmetric.? 
A convenient basis to write such infinitesimal rotations is 
<0= So A*, (5.13) 
1<k<l<N 


where A“) are the elementary skew-symmetric matrices such that A“) has only two non- 


zero elements: [A“” ],; = 1 and [A“ ]),, = —1: 
0... ... O 
1 
A®) = l (5.14) 
—-1 
0... ... O 


An infinitesimal rotation is therefore fully described by N(N — 1)/2 generalized “angles” 
Oki. 


5.1.4 Vandermonde Determinant 


Now, in the neighborhood of (A, O = 1), the matrix M + ôM can be parameterized as 


M+6Mx [1+ y OA“ | (A+ 5A) |1- 5 Ou A*) |. (5.15) 
k,l k,l 


So to first order in 6A and 6/, 


ôM ~ ôA +Y Ok [awa = AA] : (5.16) 
k,l 


Using this local parameterization, we can compute the Jacobian matrix and find its deter- 
minant. For the diagonal contribution, we have 
ƏəƏMij 
OAnn 
i.e. perturbing a given eigenvalue only changes the corresponding diagonal element with 
slope 1. 
For the rotation contribution, one has, fork < l andi < j, 


= bind jn, (5.17) 


3 The reader familiar with the analysis of compact Lie groups will recognize the statement that skew-symmetric matrices form 
the Lie algebra of O(N). 
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ƏəƏMij 


= (ADA — aa) 
00x: 


(5.18) 


1J 


= Ap—Ag, ifi=k,j=l, 
7 0, otherwise. 


i.e. an infinitesimal rotation in the direction kl modifies only one distinct off-diagonal 
element (My; = Mix) with slope A; — Ax. In particular, if two eigenvalues are the same 
(Ax = 4,) a rotation of the eigenvectors in that subspace has no effect on the matrix M. This 
is expected since eigenvectors in a degenerate subspace are only defined up to a rotation 
within that subspace. 

Finally, the N(N + 1)/2 x N(N + 1)/2 determinant has its first N diagonal elements 
equal to unity, and the next N(N — 1)/2 are equal to all possible pair differences A; — Àj. 
Hence, 


1 
1 
A(M) = det A2—A1 = | Jae — Ax). 
à3 — M1 k<l 
ÀN — ÀN-1 
(5.19) 
The absolute value of A is then given by 

|A(M)| = | | lac — Axl. (5.20) 


k<£ 


We can check that this result has the expected dimension dN =D/2, since the product 
contains exactly N(N — 1)/2 terms. The determinant A(M) is called the Vandermonde 
determinant as it is equal to the determinant of the following N x N Vandermonde matrix: 


1 1 S 

MI 2 3 oe XN 

Be RG AG Rie Age (5.21) 
N-1 ower Lge. GN- 

E E ae 


Since the above Jacobian has no dependence on the matrix O, we can integrate out the 
rotation part of (5.1) to get the joint distribution of eigenvalues: 


N 
N 
P({Ai}) œ | [lax — àl exp -x > von] . (5.22) 
i=1 


k<l 
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A key feature of the above probability density is that the eigenvalues are not independent, 
since the term [],_,|Ax — à| indicates that the probability density vanishes when 
two eigenvalues tend towards one another. This can be interpreted as some effective 
“repulsion” between eigenvalues, as we will expand on now using an analogy with 
Coulomb gases. 


Exercise 5.1.1 Vandermonde determinant for 2 x 2 matrices 
In this exercise we will explicitly compute the Vandermonde determinant for 
2 x 2 matrices. We define O and A as 


_ ( cos(@) — sin(@) _ (ra 0 
2 £ sin(0) > MA @ a 22) 


Then any 2 x 2 symmetric matrix can be written as M = OAO”. 


(a) Write explicitly M11, Mj2 and M22 as a function of 41, 42 and 8. 

(b) Compute the 3 x 3 matrix A of partial derivatives of M11, Mj2 and Mo with 
respect to Aj, Az and 6. 

(c) In the special cases where 0 equals 0, 2/4 and 2/2 show that |det A| = 
|A1 — A2|. If you have the courage show that | det A| = |A — A2| for all 6. 


Exercise 5.1.2 Wigner surmise 

Wigner was interested in the distribution of energy level spacings in heavy 
nuclei, which he modeled as the eigenvalues of a real symmetric random 
matrix (time reversal symmetry imposes that the Hamiltonian be real). Let 
x = |Ag41 — Ax| for k in the bulk. In principle we can obtain the probability 
density of x by using Eq. (5.22) and integrating out all other variables. In practice 
it is very difficult to go much beyond N = 2. Since the N = 2 result (properly 
normalized) has the correct small x and large x behavior, Wigner surmised that 
it must be a good approximation at any N. 


(a) Foran N = 2 Gor matrix, ie. V(A) = à? / 207, write the unnormalized law 
of A; and Ao, its two eigenvalues. 

(b) Change variables to A+ = A2 + Aj, integrate out A+ and write the unnormal- 
ized law of x = |A_|. 

(c) Normalize your law and choose o such that E[x] = 1; you should find 


P(x) = Žx ex [-="] (5.24) 
eeu laa 
(d) Using Eq. (5.26) redo the computation for GUE (6 = 2). You should find 


32 4 
P(x) = =x? exp =| (5.25) 
N T 
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5.2 Coulomb Gas and Maximum Likelihood Configurations 
5.2.1 A Coulomb Gas Analogy 


The orthogonal ensemble defined in the previous section can be generalized to complex 
or quaternion Hermitian matrices. The corresponding joint distribution of eigenvalues is 
simply obtained by adding a factor of 6 (equal to 1, 2 or 4) to both the potential and the 
Vandermonde determinant: 


N 
P(A) = Zy" i làk — ne exp |- È nva | 


k<l i=l 


N N 
= ZĮ! exp ar YO NV(AI) — Yo log|a;—ajl | g (5.26) 
i=l ‘gat 
J+i 
This joint law is exactly the Gibbs-Boltzmann factor (e~*/") for a gas of N particles 
moving on a one-dimensional line, at temperature T = 2/8, whose potential energy is 
given by NV (x) and that interact with each other via a pairwise repulsive force generated 
by the potential Vp (x, y) = — log(|x — y|). Formally, the repulsive term happens to be the 
Coulomb potential in two dimensions for particles that all have the same sign. In a truly 
one-dimensional problem, the Coulomb potential would read Vig(x,y) = —|x — y|, but 
with a slight abuse of language one speaks about the eigenvalues of a random matrix as a 
Coulomb gas (in one dimension). 

Even though we are interested in one particular value of 6 (namely 6 = 1), we can build 
an intuition by considering this system at various temperatures. At very low temperature 
(i.e. B — ov), the N particles all want to minimize their potential energy and sit at the 
minimum of N V (x), but if they try to do so they will have to pay a high price in interaction 
energy as this energy increases as the particles get close to one another. The particles will 
have to spread themselves around the minimum of NV(x) to minimize the sum of the 
potential and interaction energy and find the configuration corresponding to “mechanical 
equilibrium’, i.e. such that the total force on each particle is zero. At non-zero temperature 
(finite £) the particles will fluctuate around this equilibrium solution. Since the repulsion 
energy diverges as any two eigenvalues get infinitely close, the particles will always avoid 
each other. Figure 5.2 shows a typical configuration of particles/eigenvalues for N = 20 at 
8 = 1 in a quadratic potential (GOE matrix). 

In the next section, we will study this equilibrium solution, which is exact at low tem- 
perature 8B — œ or when N — ox, and is the maximum likelihood solution at finite 6 and 
finite N. 


5.2.2 Maximum Likelihood Configuration and Stieltjes Transform 


In the previous section, we saw that in the Coulomb gas analogy, 6 — oo corresponds to 
the zero temperature limit, and that in this limit the eigenvalues freeze to the minimum of 
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Figure 5.2 Representation of a typical N = 20 GOE matrix as a Coulomb gas. The full curve 
represents the potential V(x) = x? /2 and the 20 dots, the positions of the eigenvalues of a typical 
configuration. In this analogy, the eigenvalues feel a potential N V(x) and a repulsive pairwise 
interaction V (x,y) = —log(|x — y|). They fluctuate according to the Boltzmann weight e BE/2 
with £ = 1 in the present case. 


the energy (potential plus interaction). We will argue that this freezing (or concentration of 
the equilibrium measure) also happens when N — oo for fixed £. 
Let us study the minimum energy configuration. We can rewrite Eq. (5.26) as 


N N 
LBNL: 1 
PAND oc e7PNA), LAND =—PIVGI+ g DI bgl ajl (52 
i=l i j=l 
j+i 
where £ stands for “log-likelihood”. For finite N and finite 6, we can still consider the 
solution that maximizes £({A;}). This is the maximum likelihood solution, i.e. the config- 
uration of {A;} that has maximum probability. The maximum of £ is determined by the 
equations 


N 
OL 2 1 
—=0 Vai) = — : 5.28 
a, =O V's) e (5.28) 
jäi 


These are N coupled equations of N variables which can get very tedious to solve even 
for moderate values of N. In Exercise 5.2.1 we will find the solution for N = 3 in 
the Wigner case. The solution of these equations is the set of equilibrium positions of 
all the eigenvalues, i.e. the set of eigenvalues that maximizes the joint probability. To 
characterize this solution (which will allow us to obtain the density of eigenvalues), we 
will compute the Stieltjes transform of the {A;} satisfying Eq. (5.28). The trick is to make 
algebraic manipulations to both sides of the equation to make the Stieltjes transform 
explicitly appear. 
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In a first step we multiply both sides of Eq. (5.28) by 1/(z — ài) where z is a complex 
variable not equal to any eigenvalues, and then we sum over the index i. This gives 


1 V'A) 2 
ay — Ài AÈ =e 


i=1 


N N 1 
N? | A, (hi - 45) cot A; — hi) @— Ay) 
oa ae 
N 
1 2 1 1 
— = Z 
N? 2 Emea ri” M O Goa 
‘ii 
Sy (2) 
= gy(z) + F , (5.29) 
where gy (z) is the Stieltjes transform at finite N: 
E 
:= — : 5.30 
OE a eer (5.30) 


i=l 
We still need to handle the left hand side of the above equation. First we add and subtract 
V'(z) on the numerator, yielding 


= V'(Zgn(z) — Mn (z), (5.31) 


N N 
VAD y 1 ys V'@) = V'A) 
= Oss 3 oe 


where we have defined a new function TI y (z) as 


1 3 V'(z) — V'A) 


II = 
NON z—M 


(5.32) 
i=l 
This does not look very useful as the equation for gy (z) will depend on some unknown 
function II y (z) that depends on the eigenvalues whose statistics we are trying to determine. 
The key realization is that if V’(z) is a polynomial of degree k then Iy(z) is also a 
polynomial and it has degree k — 1. Indeed, for each i in the sum, V’(z) — V’(A;) is a 
degree k polynomial having z = åA; as a zero, so (V'(z) — V’(A;))/(z — ài) is a polynomial 
of degree k — 1. TIy (z) is the sum of such polynomials so is itself a polynomial of degree 
k-1. 
In fact, the argument is easy to generalize to the Laurent polynomials, i.e. such that 
zkv'(z) is a polynomial for some k € N. For example, in the Wishart case we have a 
Laurent polynomial 


j 1 q-1 
Vig=~ (: + i"). (5.33) 
q z 
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Nevertheless, from now on we make the assumption that V’(z) is a polynomial. We will 
later discuss how to relax this assumption. 
Thus we get from Eq. (5.29) that 


gy (Z) 
N 
for some polynomial TI y of degree deg(V’(z)) — 1, which needs to be determined self- 
consistently using Eq. (5.32). For a given V'(z), the coefficients of TIy are related to the 
moments of the {A;}, which themselves can be obtained from expanding gy(z) around 
infinity. In some cases, Eq. (5.34) can be solved exactly at finite N, for example in the case 
where V (z) = z?/2, in which case the solution can be expressed in terms of Hermite poly- 
nomials — see Chapter 6. In the present chapter we will study this equation in the large N 
limit, which will allow us to derive a general formula for the limiting density of eigenvalues. 
Note that since Eq. (5.34) does not depend on the value of $, the corresponding eigenvalue 

density will also be independent of £. 


V'(z)gn (z) — Mw (z) = gy2) + (5.34) 


Exercise 5.2.1 Maximum likelihood for 3 x 3 Wigner matrices 
In this exercise we will write explicitly the three eigenvalues of the maximum 
likelihood configuration of a 3 x 3 GOE matrix. The potential for this ensemble 
is V(x) = x?/2. 
(a) Let Aj,A2,A3 be the three maximum likelihood eigenvalues of the 3 x 3 GOE 
ensemble in decreasing order. By symmetry we expect 43 = —A,. What do 
you expect for 12? 
(b) Consider Eq. (5.28). Assuming (A3 = —A 1), check that your guess for àz is 
indeed a solution. Now write the equation for A; and solve it. 
(c) Using your solution and the definition (5.30), show that the Stieltjes transform 
of the maximum likelihood configuration is given by 


2 1 
oa 

DO = s (5.35) 
OS SS 


(d) Inthe simple case V(x) = x? /2, the zero-degree polynomial TI y (z) is just a 
constant (independent of N) that can be evaluated from the definition (5.32). 
What is this constant? 

(e) Verify that your g3(z) satisfies Eq. (5.34) with N = 3. 


5.2.3 The Large N Limit 


In the large N limit, gy (z) is self-averaging so computing gj (z) for the most likely con- 
figuration is the same as computing the average g(z). As N — on, Eq. (5.34) becomes 


V’(z)a(z) — H(z) = 9° (2). (5.36) 
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Each value of N gives a different degree-(k — 1) polynomial TI y (z). From the definition 
(5.32), we can show that the coefficients of I1y(z) are related to the moments of the 
maximum likelihood configuration of size N. In the large N limit these moments converge 
so the sequence II y (z) converges to a well-defined polynomial of degree (k — 1) which we 
call TI (z). 

Since Eq. (5.36) is quadratic in g(z), its solution is given by 


g(z) = ; væ + JV"(z)2 — mO) , (5.37) 


where we have to choose the branch where g(z) goes to zero for large |z]. 

The eigenvalues of M will be located where g(z) has an imaginary part for z very close 
to the real axis. The first term V’(z) is a real polynomial and is always real for real z. The 
expression V’ (z)? — 4TI (z) is also a real polynomial so g(z) cannot be complex on the real 
axis unless V’(z)? — 4T (z) < 0. In this case /V’(z)? — 41 (z) is purely imaginary. We 
conclude that, when x is such that p(x) + 0, 


pada VE) 


Roa) = f PA = (5.38) 


where f denotes the principal part of the integral. Re(g(x))/z is also called the Hilbert 
transform of p (A).4 

We have thus shown that the Hilbert transform of the density of eigenvalues is (within its 
support) equal to 2/2 times the derivative of the potential. We thus realize that the potential 
outside the support of the eigenvalue has no effect on the distribution of eigenvalues. This 
is natural in the Coulomb gas analogy. At equilibrium, the particles do not feel the potential 
away from where they are. One consequence is that we can consider potentials that are not 
confining at infinity as long as there is a confinement region and that all eigenvalues are 
within that region. For example, we will consider the quartic potential V(x) = x7/2 + 
yx*/4. For small negative y the region around x = 0 is convex. If all eigenvalues are 
found to be contained in that region, we can modify at will the potential away from it so 
that V(x) — +00 for |x| — oo and keep Eq. (5.1) normalizable, as a probability density 
should be. 

Suppose now that we have a potential that is not a polynomial. In a finite region we can 
approximate it arbitrarily well by a polynomial of sufficiently high degree. If we choose the 
region of approximation such that for every successive approximation all eigenvalues lie in 
that region, we can take the limits of these approximations and find that Eq. (5.38) holds 
even if V’(x) is not a polynomial. 

We can also ask the reverse question. Given a density p(x), does there exist a model from 
the orthogonal ensemble (or other B-ensemble) that has p(A) as its eigenvalue density? If 
the Hilbert transform of p(x) is well defined, then the answer is yes and Eq. (5.38) gives the 


4 This isa slight abuse of terms, however, that can lead to paradoxes, if one extends Eq. (5.38) to the region where p(x) = 0. In 
this case, the right hand side of the equation is not equal to V’(x)/2. See also Appendix A.2. 
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corresponding potential. Note that the potential is only defined up to an additive constant 
(it can be absorbed in the normalization of Eq. (5.1)) so knowing its derivative is enough 
to compute V(x). Note also that we only know the value of V(x) on the support of p(x); 
outside this support we can arbitrarily choose V (x) provided it is convex and goes to infinity 
as |x| > oo. 


Exercise 5.2.2 Matrix potential for the uniform density 
In Exercise 2.3.3, we saw that the Stieltjes transform for a uniform density of 
eigenvalues between 0 and | is given by 


g(z) = log (5) l (5.39) 


(a) By computing Re(g(x)) for x between 0 and 1, find V’ (x) using Eq. (5.38). 

(b) Compute the Hilbert transform of the uniform density to recover your answer 
in (a). 

(c) From your answer in (a) and (b), show that the matrix potential is given by 


V(x) =2[(1 — x) log — x) + x log(x)] +C for0<x<1, (5.40) 


where C is an arbitrary constant. Note that for x < 0 and x > 1 the potential 
should be completed by a convex function that goes to infinity as |x| —> oo. 


5.3 Applications: Wigner, Wishart and the One-Cut Assumption 
5.3.1 Back to Wigner and Wishart 


Now we apply the results of the previous section to the Gaussian orthogonal case where 
Viz) = 2 /20?. In this simple case, ITI (z) can be computed from its definition without 
knowing the eigenvalues, since 


1 
vV@© =% > I0 =. (5.41) 
o o 
Then (5.37) gives 
8/77 a) 
z— z- — 40 
g(z) = ————.—_, (5.42) 
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which recovers, independently of the value of £, the Wigner result Eq. (2.38), albeit within 
a completely different framework (the notation &/ was introduced in Section 4.2.3). In 
particular, the cavity method does not assume that the matrix ensemble is rotationally 
invariant, as we do here. 
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In the Wishart case, we only consider the case q < 1, otherwise (q > 1) the potential is 
not confining and we need to impose the positive semi-definiteness of the matrix to avoid 
eigenvalues running to minus infinity. We have 


j 1 q-1 
V (z)= -|1+—]}]. (5.43) 
q Z 
In this case z V” (z) is of degree one, so zTI (z) is a polynomial of degree zero: 


T(z) = - (5.44) 


for some constant c. Thus (5.37) then gives 


z+q-1- Ye +q- 1} —4cq2z 
2qz l 


g(z) = (5.45) 


As z —> +00, this expression becomes 
c 
gz) = a + O(1/z*). (5.46) 


Imposing g(z) ~ z7! gives c = q7!. After some manipulations, we recover Eq. (4.40). 


5.3.2 General Convex Potentials and the One-Cut Assumption 


For more general polynomial potentials, finding an explicit solution for the limiting Stieltjes 
transform is a little bit more involved. We recall Eq. (5.37): 


g(z) = ; vo + VV"(z)? — mO) ; (5.47) 


For a particular polynomial V’(z), TI (z) is a polynomial that depends on the moments of the 
matrix M. The expansion of g(z) near z —> oo will give a set of self-consistent equations 
for the coefficients of TI (z). 

The problem simplifies greatly if the support of density of eigenvalues is compact, i.e. if 
the density o(A) is non-zero for all A’s between two edges A_ and A+. We expect this to be 
true if the potential V(x) is convex. Indeed, by the Coulomb gas analogy we could place 
all eigenvalues near the minimum of V (x) and let them find their equilibrium configuration 
by repelling each other. For a convex potential it is natural to assume that the equilibrium 
configuration would not have any gaps. This assumption is equivalent to assuming that 
the limiting Stieltjes transform has a single branch cut (from A_ and à+), hence the name 
one-cut assumption. 

So, for a convex polynomial potential V(x), we expect that there exists a well-defined 
equilibrium density (A) that is non-zero if and only if A_ < A < A and that g(z) satisfies 


A+ 
a(z) = / LAG a, (5.48) 
x. Z—À 
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From this equation we notice three important properties of 9(z): 


¢ The function g(z) is potentially singular at A_ and A+. 

e Near the real axis (Imz = 07) g(z) has an imaginary part if z € (A_,A+) and is real 
otherwise. 

¢ The function g(z) is analytic everywhere else. 


If we go back to Eq. (5.37), we notice that any non-analytic behavior must come from 
the square-root. On the real axis, the only way g(z) can have an imaginary part is if D(z) := 
V'(z)*—4I1(z) < 0 for some values of z. So D(z) (a polynomial of degree 2k) must change 
sign at some values A_ and A+, hence these must be zeros of the polynomial. On the real 
axis, the other possible zeros D(z) can only be of even multiplicity (otherwise D(z) would 
change sign). Elsewhere in the complex plane, zeros should also be of even multiplicity, 
otherwise ./D(z) would be singular at those zeros. In other words D(z) must be of the 
form 


D(z) = (z —A-)(Z—A4) 7), (5.49) 


for some polynomial Q(z) of degree k — 1 where k is the degree of V’(z). We can therefore 
write g(z) as 


V'(z) = O(z) V@ —A_-)Z = Ax) 
5 ; 


where again Q(z) is a polynomial with real coefficients of degree one less than V’(z). The 
condition that 


g(z) = (5.50) 


1 
g(z) > — when |z| > oo (5.51) 
g 


is now sufficient to compute Q(z) and also A+ for a given potential V (z). Indeed, expanding 
Eq. (5.50) near z —> œ, the coefficients of Q(z) and the values A+ must be such as to 
cancel the k + 1 polynomial coefficients of V’(z) and also ensure that the 1/z term has unit 
coefficient. This gives k + 2 equations to determine the k coefficients of Q(z) and the two 
edges A+, see next section for an illustration. 

Once the polynomial Q(x) is determined, we can read off the eigenvalue density: 


— QA) /A+— AMAA) 
~ 20 


pa) for doe he (5.52) 


We see that generically the eigenvalue density behaves as p (Az. F8) x +/6 near both edges 
of the spectrum. If by chance (or by construction) one of the edges is a zero of Q(z), then 
the behavior changes to 6° near that edge, with @ = n + 5 and n the multiplicity of root of 
Q(z). A potential with generic \/6 behavior at the edge of the density is called non-critical. 
Other non-generic cases are called critical. In Section 14.1 we will see how the vô edge 
singularity is smoothed over a region of width N~*/? for finite N. In the critical case, the 
smoothed region is of width N~?/G+2”), 
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5.3.3 M? + Mt Potential 


One of the original motivations of Brézin, Itzykson, Parisi and Zuber to study the 
ensemble defined by Eq. (5.1) was to count the so-called planar diagrams in some field 
theories. To do so they considered the potential 


2 4 


x 
V(x) = D + ae (5.53) 


We will not discuss how one can count planar diagrams from such a potential, but use 
this example to illustrate the general recipe given in the main text to compute the Stieltjes 
transform and the density of eigenvalues. Interestingly, for a certain value of y, the edge 
singularity is 53/2 instead of /6. 

Since the potential is symmetric around zero we expect à+ = —A_ =: 2a. We 
introduce this extra factor of 2, so that if y = 0, we obtain the semi-circular law with 


a = 1. Since V’(z) = z + yz? is a degree three polynomial, we write 
Oz) =a + az +yz, (5.54) 


where the coefficient of z? was chosen to cancel the yz? term at infinity. Expanding 
Eq. (5.50) near z > œ and imposing g(z) = 1/z + O(1/z?) we get 


a, = 0, 
1 — ap + 2ya” =0, (5.55) 
2aty + 2aay =o. 


Solving for ag, we find 


zty2—(4+2ya? + yz?) O/72 — Aa? 


= 5.56 
g(z) 5 (5.56) 
where a is a solution of 
V1+12y-1 
3yat +a? -1=0 > a= T (5.57) 
The density of eigenvalues for the potential (5.53) reads 

1+ 2ya* + yr?) vV 4a2 — 22 1 

jogja Lte i Jv% for y> -5 (5.58) 
Fg 


with a defined as above. For positive values of y, the potential is confining (it is convex 
and grows faster than a logarithm for z —> +00). In that case the equation for a always has 
a solution, so the Stieltjes transform and the density of eigenvalues are well defined; see 
Figure 5.3. For small negative values of y, the problem still makes sense. The potential is 
convex near zero and the eigenvalues will stay near zero as long as the repulsion does not 
push them too far in the non-convex region. 


There is a critical value of y at ye = —1/12, which corresponds to a = V2. At this 
critical point, ge (z) and the density are given by 
3/2 
z2 g\2 12 (8 z a2) 
(z) = 1 1 d pc (à) = ~————__. 5.59 
TORET ( 5) + = | and oQ) EE (5.59) 


At this point the density of eigenvalues at the upper edge (A+ = 2 /2) behaves as p(A) ~ 
(272 — ia with 0 = 3/2 and similarly at the lower edge (A = —2 /2). For values of 
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Figure 5.3 (left) Density of eigenvalues for the potential V(x) = ty? + kxf for three values of y. 
For y = 1, even if the minimum of the potential is at à = 0, the density develops a double hump due 
to the repulsion of the eigenvalues. y = 0 corresponds to the Wigner case (semi-circle law). Finally 
y = —1/12 is the critical value of y. At this point the density is given by Eq. (5.59). For smaller 
values of y the density does not exist. (right) Shape of the potential for the same three values of y. 
The dots on the bottom curve indicate the edges of the critical spectrum. 


y more negative than yc, there are no real solutions for a and Eq. (5.56) ceases to make 
sense. In the Coulomb gas analogy, the eigenvalues push each other up to a critical point 
after which they run off to infinity. There is no simple argument that gives the location of 
the critical point (except for doing the above computation). It is given by a delicate balance 
between the repulsion of the eigenvalues and the confining potential. In particular it is not 
given by the point V’(2a) = 0 as one might naively expect. Note that at the critical point 
V” (2a) = —1, so we are already outside the convex region. 


5.4 Fluctuations Around the Most Likely Configuration 
5.4.1 Fluctuations of All Eigenvalues 


The most likely positions of the N eigenvalues are determined by the equations (5.28) that 
we derived in Section 5.2.2. These equations balance a Coulomb (repulsive) term and a 
confining potential. On distances d of order 1/N, the Coulomb force on each “charge” is 
of order d~! ~ N, whereas the confining force V’ is of order unity. Therefore, the Coulomb 
force is dominant at small scales and the most likely positions must be locally equidistant. 
This is expected to be true within small enough intervals 


I := [à — L/2N,à + L/2N], (5.60) 


where L is sufficiently small, such that the average density o(A) does not vary too much, 
i.e. p'(A)L/N & p(A). 
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Of course, some fluctuations around this crystalline order must be expected, and the aim 
of this section is to introduce a simple method to characterize these fluctuations. Before 
doing so, let us discuss how the number of eigenvalues n(L) in interval J behaves. For 
a perfect crystalline arrangement, there are no fluctuations at all and one has n(L) = 
p(A)L + O(1), where the O(1) term comes from “rounding errors” at the edge of the 
interval. 

For a Poisson process, where points are placed independently at random with some 
density Np (à), then it is well known that 


n(L) =a+éVn, with A:= p(A)L, (5.61) 


where & is a Gaussian random variable N (0, 1). For the eigenvalues of a large symmetric 
random matrix, on the other hand, the exact result is given by 


n(L)=ñ+ VA, A= 5 [log(a) + C] +067’), (5.62) 


where C is a numerical constant. This result means that the typical fluctuations of the 
number of eigenvalues is of order ./logn for large L, much smaller than the Poisson result 
Vn. In fact, the growth of A with L is so slow that one can think of the arrangement of 
eigenvalues as “quasi-crystalline”, even after factoring in fluctuations. 


Let us see how this ./logn behavior of the fluctuations can be captured by computing 
the Hessian matrix H of the log-likelihood £ defined in Eq. (5.27). One finds 


1 2 . : 
aL V” (ài) + N S kzi (Aj—Ag)2 (i = j), 
aoe ot 
OA; OA j NNE (i +j). 


This Hessian matrix should be evaluated at the maximum of £, i.e. for the most likely 
configurations of the A;. If we call €;/N the deviation of 4; away from its most likely 
value, and assume all € to be small enough, their joint distribution can be approximated 
by the following multivariate Gaussian distribution: 


P({e;}) x exp -E 5 eiH;jej |. (5.63) 
i,j 


from which one can obtain the covariance of the deviations as 
2N oa 
Eleiej]= yA Ni (5.64) 


Since the most likely positions of the A; are locally equidistant, one can approximate 
ài — àj as (i — j)/Nọ in the above expression. This is justified when the contribution of 


far-away eigenvalues to H”! is negligible in the region of interest, which will indeed be 
the case because the off-diagonal elements of H decay fast enough (i.e. as (i — j )72). 


2 
5 For b = 1, one finds C = log(27) +y + 1 + E where y is Euler’s constant, see Mehta [2004]. 
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For simplicity, let us consider the case where V(x) = x? /2, in which case V” (x) = 1. 
The matrix H is then a Toeplitz matrix, up to unimportant boundary terms. It can be 
diagonalized using plane waves (see Appendix A.3), with eigenvalues given by® 


Ta 1 — cos 2594 
uq =1+4Np* \> 2 ,  q=0,1,...,N=—1, (5.65) 
tsi 


where p is the local average eigenvalue density. In the large N limit, the (convergent) sum 
can be replaced by an integral: 


1] — cosu 


E 1 +47? p7Iq|. (5.66) 


u = 1 44N? x ZUL f” au 
4 N 0 u 


The eigenvalues of H7! are then given by 1/ Lq and the covariance of the deviations for 
i — j = n is obtained from Eq. (5.64) as an inverse Fourier transform. After transforming 
q —> uN /2r this reads 


2N 1 (7 du e™iun 


Ble -N fe Oe NV rana (381) 
Now, the fluctuating distance d;; between eigenvalues i and j = i +n is, by definition of 
the e, 
A E ea (5.68) 
Np N 

Its variance is obtained, for N —> oo, from 

Elle; — €;)}] ~ 2 f ge angl Zigi. (5.69) 
i Br? oN)? Jo u Bx? p* 


The variables € are thus long-ranged, log-correlated Gaussian variables. Interestingly, 
there has been a flurry of activity concerning this problem in recent years (see references 
at the end of this chapter). 

Finally, the fluctuating local density of eigenvalues can be computed from the number 
of eigenvalues within a distance d;j, i.e. 


l n 2€i €j 
ao ee (n> 1). (5.70) 
N dij n 
Its variance is thus given by 
2p 
V[e] = logn. 5.71 
(ol = 57,2 oe (5.71) 


Hence, the fluctuation of the number of eigenvalues in a fixed interval of size L/N and 
containing on average n = pL eigenvalues is 


VIpL] = L7V[p] = e: log A. (5.72) 
pr 


This argument recovers the leading term of the exact result for all values of 6 (compare 
with Eq. (5.62) for B = 1). 


6 In fact, the Hessian H can be diagonalized exactly in this case, without any approximations, see Agarwal et al. [2019] and 
references therein. 
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5.4.2 Large Deviations of the Top Eigenvalue 


Another interesting question concerns the fluctuations of the top eigenvalue of a matrix 
drawn from a 6-ensemble. Consider Eq. (5.26) with Amax isolated: 


NB eis 
P(Amax3 {Ai }) = Pn—1({Ai}) exp Tz Vmax) — N Xo log(Amax — Ai) , (5.73) 


i=l 


with 
B N-1 N-1 
Pu-i (li) == Zy’ exp R Do a log |A; — Ajl | }- (5.74) 
i=l i, j=l 
j+i 


At large N, this probability is dominated by the most likely configuration, which is deter- 
mined by Eq. (5.27) with Amax removed. Clearly, the most likely positions {A;} of the N — 1 
other eigenvalues are only changed by an amount O(N~'), but since the log-likelihood is 
close to its maximum, the change of £ (i.e. the quantity in the exponential) is only of order 
N~? and we will neglect it. Then one has the following large deviation expression: 


P (Àmax; {07}) eE _ NB 
Pasay oP | E DOme l (5.75) 
with 
N-1 2 N-1 
(x) = V(x) — VAL) — FD loga — A+ Do logas -a 6.76) 


i=l i=1 
where A is the top edge of the spectrum. Note that ® (à+) = 0 by construction. To deal 


with the large N limit of this expression, we take the derivative of ®(x) with respect to x 
to find 


z% j N 
®'(x) = V(x) x 2 = >” V(x) — 2g(x). (5.77) 
Hence, 
(x) = f (V'(s) — 2g(s)) ds. (5.78) 
A+ 


When the potential V’(x) is a polynomial of degree k > 1, we can use Eq. (5.37), 
yielding 
x 
P(x) = f V’ (s)? — 4I (s) ds, (5.79) 
À+ 
where À+ is the largest zero of the polynomial V’(s)? — 4T (s). Since TI (s) is a polynomial 


of order k — 1, for large s one always has 


MORSO (5.80) 
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and therefore 
D(x) © V(x). (5.81) 


As expected, the probability to observe a top eigenvalue very far from the bulk is dominated 
by the potential term, and the Coulomb interaction no longer plays any role. When x — A+ 
is small but still much larger than any inverse power of N, we have 


Vv V' (s)? — 411 (s) ~ C(s — aa)’, (5.82) 


where C is a constant and 0 depends on the type of root of V’ (s)? — 411 (s) ats = A+. For 
a single root, which is the typical case, 0 = 1/2. Performing the integral we get 


DOmax) = (Amas — à4+)°t!, Ame Age <1. (5.83) 


0+1 


Note that the constant C can be determined from the eigenvalue density near (but slightly 
below) à+, to wit, mo (à) ~ CA4 — A)?. 

For a generic edge with 0 = 1/2, one finds that ®(Amax) & (Amax — 4), and thus 
the probability to find Amax just above A+ decays as 


2BC 
P(Amax) ~ exp -26u : u = N?P (Amax — à+), (5.84) 


where we have introduced the rescaled variable u in order to show that this probability 

decays on scale N -2/3 We will further discuss this result in Section 14.1, where we will 

see that the u?/? behavior coincides with the right tail of the Tracy-Widom distribution. 
For a unit Wigner, we have (see Fig. 5.4) 


25 
=== Wigner 
20 —— Wishart 
_ 15 
Nay 
© 
10 
5 
0 
2 3 4 5 6 7 8 


Figure 5.4 Large deviation function for the top eigenvalue ®(x) for a unit Wigner and a Wishart 
matrix with Ay = 2 (q = 3-2 ./2). The Wishart curve was obtained by numerically integrating 
Eq. (5.78). 
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1 2—4 
Da) = 5x vx? 4 zioe ( z =) for x>2. (5.85) 


For a Wishart matrix with parameter q = 1, 


Pa) = J/(@ —4)x + 2log (= y —) for x>4. (5.86) 


Remember that the Stieltjes transform g(z) and the density of eigenvalues p(A) are only 
sensitive to the potential function V(x) for values in the support of p (AL < x < à+). 
The large deviation function ® (x), on the other hand, depends on the value of the potential 
for x > A+. For Eq. (5.79) to hold, the same potential function must extend analytically 
outside the support of p. In Section 5.3.3, we considered a non-confining potential (when 
y < 0). We argued that we could compute g(z) and p (À) as if the potential were confining 
as long as we could replace the potential outside (A_,A+) by a convex function going 
to infinity. In that case, the function ®(x) depends on the choice of regularization of the 
potential. Computing Eq. (5.78) with the non-confining potential would give nonsensical 
results. 


Exercise 5.4.1 Large deviations for Wigner and Wishart 


(a) Show that Eqs. (5.85) and (5.86) are indeed the large deviation function for 
the top eigenvalues of a unit Wigner and a Wishart g = 1. To do so, show 
that they satisfy Eq. (5.77) and that D(A) = 0 with Ay = 2 for Wigner and 
A+ = 4 for Wishart q = 1. 

(b) Find the dominant contribution of both ®(x) for large x. Compare to their 
respective V (x). 

(c) Expand the two expressions for ®(x) near their respective A+ and show that 
they have the correct leading behavior. What are the exponent 0 and constant 
Cc? 


5.5 An Eigenvalue Density Saddle Point 


As an alternative approach to the most likely configuration of eigenvalues, one often 
finds in the random matrix literature the following density formalism. One first introduces 
the density of “charges” w(x) for a given set of eigenvalues A1,42, . . . , Ày (not necessarily 
the most likely configuration), as 


w(x) = Loo — x). (5.87) 


Expressed in terms of this density field, the joint distribution of eigenvalues, Eq. (5.26) 
can be expressed as 


5.5 An Eigenvalue Density Saddle Point 
=j BN? 
Plo) =Z exp| -5 ] | drowVo)- $ dxdyo(x)o(y) log |x = y| 


— N f rwo) logos), (5.88) 


where the last term is an entropy term, formally corresponding to the change of variables 
from the {4;} to w(x). Since this term is of order N, compared to the two first terms that 
are of order N?, it is usually neglected.’ 

One then proceeds by looking for the density field that maximizes the term in the 
exponential, which is obtained by taking its functional derivative with respect to all w(x): 


ô 
T f irowo- f aya'oooo logly—y'|—¢ | dywQ)| =0, 
ôw(x) o* 
(5.89) 
where ¢ is a Lagrange multiplier, used to impose the normalization condition 
J dxw(x) = 1. This leads to 


V(x) = af dyæ* (y) log |x — y| + ċ. (5.90) 
We can now take the derivative with respect to x to get 
* 
V(x) = af ge, (5.91) 
x—y 


which is nothing but the continuum limit version of Eq. (5.28), and is identical to 
Eq. (5.38). Although this whole procedure looks somewhat ad hoc, it can be fully 
justified mathematically. In the mathematical literature, it is known as the large deviation 
formalism. 

Equation (5.91) is a singular integral equation for w(x) of the so-called Tricomi type. 
Such equations often have explicit solutions, see Appendix A.2. In the case where V(x) = 


x2 /2, one recovers, as expected, the semi-circle law: 
1 
w*(x) = F V4—x?, (5.92) 
T 


One interesting application of the density formalism is to investigate the case of Gaus- 
sian orthogonal random matrices conditioned to have all eigenvalues strictly positive. 
What is the probability for this to occur spontaneously? In such a case, what is the resulting 
distribution of eigenvalues? 

The trick is to solve Eq. (5.91) with the constraint that w(x < 0) = 0. This leads to 


oo x 
x= af dg. (5.93) 
0 xy 


The solution to this truncated problem can also be found analytically using the general 
result of Appendix A.2. One finds 


Moje E E E E NE (5.94) 
@ ALS Aa pe + X), wil sa f 


The resulting density has a square-root divergence close to zero, which is a stigma of all 
the negative eigenvalues being pushed to the positive side. The right edge itself is pushed 


from 2 to 4/ /3, see Figure 5.5. 


7 Note however that when B =c/N, as considered in Allez et al. [2012], the entropy term must be retained. 
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— Conditioned Wigner 


Wigner 


Figure 5.5 Density of eigenvalues for a Wigner matrix conditioned to be positive semi-definite. The 
positive part of the density of a standard Wigner is shown for comparison. 


Injecting this solution back into Eq. (5.88) and comparing with the result corresponding 
to the standard semi-circle allows one to compute the probability for such an excep- 
tional configuration to occur. After a few manipulations, one finds that this probability 


is given by eBCN” with C = log3/4 œ% 0.2746.... The probability that a Gaussian 
random matrix has by chance all its eigenvalues positive therefore decreases extremely fast 
with N. 

Other constraints are possible as well, for example if one chooses to confine all eigen- 
values in a certain interval [€_,@4] with 2. > A_ or €4 < Ax. (In the cases where 
L- < à— and £+ > à+ the confinement plays no role.) Let us study this problem in 
the case where there is no external potential at all, i.e. when V(x) = C, where C is an 
arbitrary constant, but only confining walls. In this case, the general solution of Appendix 
A.2 immediately leads to 


1 


x Sœ- x) 


which has a square-root divergence at both edges. This law is called the arcsine law, which 
appears in different contexts, see Sections 7.2 and 15.3.1. 

Note that the minimization of the quadratic form f dxdyw(x)w(y)G(|x — y|) for a 
general power-law interaction G (u) = u`” , subject to the constraint f dxw(x) = 1, has 
been solved in the very different, financial engineering context of optimal execution with 
quadratic costs. The solution in that case reads 


w* (x) = (5.95) 


w* (x) = A(y WY J@ —¢_yv-1(e4 — x), (5.96) 


with £ := €_ + £+ and A(y) := yI[y]/ rnd + y)/2]. The case Glu) = — log(u) 
formally corresponds to y = 0, in which case one recovers Eq. (5.95). 
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Eigenvalues and Orthogonal Polynomials* 


In this chapter, we investigate yet another route to shed light on the eigenvalue density 
of the Wigner and Wishart ensembles. We show (a) that the most probable positions of 
the Coulomb gas problem coincide with the zeros of Hermite polynomials in the Wigner 
case, and of Laguerre polynomials in the Wishart case; and (b) that the average (over 
randomness) of the characteristic polynomials (defined as det(z 1 — Xy )) of Wigner or 
Wishart random matrices of size N obey simple recursion relations that allow one to express 
them as, respectively, Hermite and Laguerre polynomials. The fact that the two methods 
lead to the same result (at least for large N) reflects the fact that eigenvalues fluctuate 
very little around their most probable positions. Finally we show that for unitary ensembles 
B = 2, the expected characteristic polynomial is always an orthogonal polynomial with 
respect to some weight function related to the matrix potential. 


6.1 Wigner Matrices and Hermite Polynomials 
6.1.1 Most Likely Eigenvalues and Zeros of Hermite Polynomials 


In the previous chapter, we established a general equation for the Stieltjes transform of 
the most likely positions of the eigenvalues of random matrices belonging to a general 
orthogonal ensemble, see Eq. (5.34). In the special case of a quadratic potential V(x) = 
x? /2, this equation reads 

By (2) 


zgn(z) — 1 = g4 (z) + y (6.1) 


This ordinary differential equation is of the Ricatti type,! and can be solved by setting 
gn (z) := Y'(z)/N 4 (z). This yields, upon substitution, 


Y” (z) — Nzw'(z) + N?°Y (z) = 0, (6.2) 
or, with Y (z) = U(x = VNz), 
Ww" (x) — xY (x) + NW (x) = 0. (6.3) 


l A Ricatti equation is a first order differential equation that is quadratic in the unknown function, in our case gy (z). 
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The solution of this last equation with the correct behavior for large x is the Hermite 
polynomial of order N. General Hermite polynomials H,,(x) are defined as the nth order 
polynomial that starts as x” and is orthogonal to all previous ones under the unit Gaussian 


measure: 2 


dx x2 
—— Hp (x)Hm(x)e 2 =O whenn +m. (6.4) 
J V20 


The first few are given by 


Hox) = 1, 
Hı (x) = x, 
Hx) =x? — 1, (6.5) 


A3(x) = x= 3x, 
H4(x) = xf — 6x +3. 
In addition to the above ODE (6.3), they satisfy 


© EnG 6.6) 
dx 
and the recursion 
H, (x) = x Hy—1(x) — (n — 1)H,-2 (x), (6.7) 


which combined together recovers Eq. (6.3). Hermite polynomials can be written explic- 
itly as 


tiy lo e Aa a 
mo-a|-i() | sA a ma-a | O 


m=0 
where |n/2] is the integer part of n/2. 
Coming back to Eq. (6.1), we thus conclude that the exact solution for the Stieltjes 
transform gy (z) at finite N is 
Hy( vV Nz) 
VN Hy (VNz) 


or, writing Hy (x) = Te (x— JNA), where SNRA are the N (real) zeros of Hy (x), 


gn (Zz) = (6.9) 


N 


1 1 
gyz) = N X — (6.10) 
Pam Z7 hj 


2 Hermite polynomials can be defined using two different conventions for the unit Gaussian measure. We use here the 


“probabilists’ Hermite polynomials”, while the “physicists’ convention” uses a Gaussian weight proportional toe”. 
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Figure 6.1 Histogram of the 64 zeros of Hg4(x). The full line is the asymptotic prediction from the 
semi-circle law. 


Comparing with the definition of gy(z), Eq. (5.30), one concludes that the most likely 
positions of the Coulomb particles are exactly given by the zeros of Hermite polynomi- 
als (scaled by JN ). This is a rather remarkable result, which holds for more general 
confining potentials, to which are associated different kinds of orthogonal polynomials. 
Explicit examples will be given later in this section for the Wishart ensemble, where we 
will encounter Laguerre polynomials (see also Chapter 7 for Jacobi polynomials). 

Since we know that gy(z) converges, for large N, towards the Stieltjes transform of 
the semi-circle law, we can conclude that the rescaled zeros of Hermite polynomials are 
themselves distributed, for large N, according to the same semi-circle law. This classical 
property of Hermite polynomials is illustrated in Figure 6.1. 


6.1.2 Expected Characteristic Polynomial of Wigner Matrices 


In this section, we will show that the expected characteristic polynomial of a Wigner matrix 
Xy, defined as Oy (z) := E[det(z 1 — Xy))], is given by the same Hermite polynomial as 
above. The idea is to write a recursion relation for Q y (z) by expanding the determinant in 
minors. Since we will be comparing Wigner matrices of different size, it is more convenient 
to work with unscaled matrices Yy = J/NXv, i.e. symmetric matrices of size N with 
elements of zero mean and variance 1 (it will turn out that the variance of diagonal elements 
is actually irrelevant, and so can also be chosen to be 1). We define 


qn (Z) := E[det(z 1 — Yy)]. (6.11) 
Using det(aA) = a det(A), we then have 
On(z) = NN? qu(VNz). (6.12) 
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We can compute the first two gy (z) by hand: 


a@=Ek-Yul=s  @@=E[@-Yu@-Yn)-¥p]=2-1. 13) 


To compute the polynomials for N > 3 we first expand the determinant in minors from the 
first line. We call Mj, ; the ij-minor, i.e. the determinant of the submatrix of z 1 — Y y with 
the line i and column j removed: 


N 
det(z 1 — Yy) = S0(-1!*! êi — Y1)Mi,; 
i=1 
N 
= 7Mi,1 — YuMi,1 + XOD YuMii. (6.14) 
i=2 
We would like to take the expectation value of this last expression. The first two terms 
are easy: the minor Mj; is the same determinant with a Wigner matrix of size N — 1, so 
[M;, 1] = qn—-1(z); the diagonal element Y1; is independent from the rest of the matrix 
and its expectation is zero. 

For the other terms in the sum, the minor M; ; is not independent of Y;1. Indeed, because 
Xy is symmetric, the corresponding submatrix contains another copy of Y;;. Let us then 
expand M, ; itself on the ith row, to make the other term Y;; appear explicitly. For i + 1, 
we have 


N 
> (—1)’ Y; Mii ij 


M;i, ı = 
j=l iti 
N 
= (-1) 1 Y)Miii1 + 5 (CDY; Mi; ij, (6.15) 
j=2, iti 


where M;j xı is the “sub-minor”, with rows i, j and columns k,/ removed. 

We can now take the expectation value of Eq. (6.14) by noting that Yj; is independent 
of all the terms in Eq. (6.15) except the first one. We also realize that Mj; ;; is the same 
determinant with a Wigner matrix of size N — 2 that is now independent of Y1;, so we have 
S[Mi; i1] = gn—2(z). Putting everything together we get 


qn (Zz) := E[det(z 1 — Yy)] = zqny-1(z) — (N — I gn-2(2). (6.16) 


We recognize here precisely the recursion relation (6.7) that defines Hermite polynomials. 

How should this result be interpreted? Suppose for one moment that the positions of 
the eigenvalues 4; of Yy were not fluctuating from sample to sample, and fixed to their 
most likely values A. In this case, the expectation operator would not be needed and one 
would have 


Hy (VNz) 


d 
= — |] ee 6.17 
gn (z) qz 28 QnC) JN Hy(JND (6.17) 
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recovering the result of the previous section. What is somewhat surprising is that 


N n 
| Te) =] |e- (6.18) 
i=l i=l 


even when fluctuations are accounted for. In particular, in the limit N — oo, the average 
Stieltjes transform should be computed from the average of the logarithm of the character- 
istic polynomial: 


g(z) = 


1 d 

lim —E | — log det(z 1 — X : 6.19 
but the above calculation shows that one can compute the logarithm of the average charac- 
teristic polynomial instead. The deep underlying mechanism is the eigenvalue spectrum of 
random matrices is rigid — fluctuations around most probable positions are small. 


Exercise 6.1.1 Hermite polynomials and moments of the Wigner 
Show that (for n > 4) 


On (x) =i — "m + te — De F 2)(n E 3) nA $ O en) ; 


8n 

(6.20) 

therefore 
1 n-11 (n — 1)(n —3) 1 1 

Sou = O : 6.21 

a z NE N E z! Ga 

so in the large N limit we recover the first few terms of the Wigner Stieltjes 
transform 

(z) z +O ( } ) (6.22) 

ZQj=--aH- á : 
: Zz E æ g" 


6.2 Laguerre Polynomials 
6.2.1 Most Likely Characteristic Polynomial of Wishart Matrices 


Similarly to the case of Wigner matrices, the Stieltjes transform of the most likely positions 
of the Coulomb charges in the Wishart ensemble can be written as gy (z) := W’(z)/Nw(z), 
where W (z) is a monic polynomial of degree N satisfying 


W(x) — NV! œ W(x) + NTIN (x) W(x) = 0, (6.23) 
with 


N—-T—1+2p7! 
NV'(x) = f +T, (6.24) 
x 


88 Eigenvalues and Orthogonal Polynomials 
and, using Eq. (5.32), 


N 


1 V'a- VAD en 
II = = —, 6.25 
n@) = 572 — x F (6.25) 
k=1 
where 
N 
N-T-—1+267! 1 
cN = 5 a (6.26) 
N 4 hE 
Writing now W(x) = Y (Tx) and u = Tx, Eq. (6.23) becomes 
N2 
uv" (u) — (N —T —14+267! + u)Y' (u) + envy) = 0. (6.27) 


This is the differential equation for the so-called associated Laguerre polynomials L) with 
a = T — N — 267!. It has polynomial solutions of degree N if and only if the coefficient 
of the Y (u) term is an integer equal to N (i.e. if cy = T/N). The solution is then given by 


=i 
Wu) x LETY? u), (6.28) 
where 
d n 
dong 
L® (x) =x far TO atm, (6.29) 
n: 


Note that associated Laguerre polynomials are orthogonal with respect to the measure 


x“eT*, Le. 


(n+a)! 


(6.30) 
n! 


OO 
/ dxx%e 7 LO LO (x) = 8, 
0 
Given that the standard associated Laguerre polynomials have as a leading term 
(—1)"/N! 2% and that w(x) is monic, we finally find 


Hx) (-I)NN!IT-NL?-N-2)(Tx) real symmetric (8 = 1), (631) 
x)= i 
(-—I)NN!T-NL@-N-) (Tx) complex Hermitian (8 = 2). 


Hence, the most likely positions of the Coulomb—Wishart charges are given by the zeros 
of associated Laguerre polynomials, exactly as the most likely positions of the Coulomb- 
Wigner charges are given by the zeros of Hermite polynomials. 

We should nevertheless check that cy = T/N is compatible with Eq. (6.26), i.e. that the 
following equality holds: 


N 
T A 
a a (6.32) 
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Figure 6.2 Histogram of the 100 zeros of Ljgoo (x). The full line is the Marčenko-Pastur distribution 
(4.43) with q = 5 scaled by a factor T = 200. 


where the A¥ are the zeros of T~'L(Tx), i.e. AX = i /T, where g0 are the zeros of the 
associate Laguerre polynomials IAA which indeed obey the following relation:? 


N 


1 1 1 
) = : (6.33) 
(a) 
N = g a+ l 


From the results of Section 5.3.1, we thus conclude that the zeros of the Laguerre polyno- 
mials LT-N- (T x) converge to a Wishart distribution with q = N/T. Figure 6.2 shows 
the histogram of zeros of L (x) with the asymptotic prediction for large N and T. Note 


that æ ~ N(q7! — 1) in that limit. 


6.2.2 Average Characteristic Polynomial 


As in the Wigner case we would like to get a recursion relation for Qg, y (z) := E[det(z 1 — 
w), where wi? is a white Wishart matrix of size N and parameter g = T/N. This 
time the recursion will be over T at N fixed. So we keep N fixed (we will drop the (N) 
index to keep the notation light) and consider an unnormalized white Wishart matrix: 


T 
Yr =) vv, (6.34) 
t=1 


where v; are T N-dimensional independent random vectors uniformly distributed on the 
sphere. We want to compute 


3 Seee, g. Alıcı and Taeli [2015], where other inverse moments of the @ are also derived. 
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qr,n (z) = Eldet(z 1 — Yr)]. (6.35) 


The properly normalized expected characteristic polynomial is then given by Qq, n (z) = 
T~"qr,n(Tz). To construct our recursion relation, we will make use of the Shermann- 
Morrison formula (Eq. (1.30)), which states that for an invertible matrix A and vectors u 
and v, 


det(A + uv’) = (1+ v’A7!w) det A. (6.36) 
Applying this formula with A = z 1 — Yr_1, we get 
det(z 1- Yr) = (1 = vr (z 1- Yr_1)'vr) det(z 1- Yr_1). (6.37) 


The vector vr is independent of the rest of Yr_1, so taking the expectation value with 
respect to this last vector we get 


i [vi i= Yr-1)"'vr| = Tr[(¢1— Yr). (6.38) 
Now, using once again the general relation (easily derived in the basis where A is diagonal), 
d 
Tr[(z1 — A)~']det(z1— A) = a det(z 1 — A), (6.39) 
z 
with A = Y7_,, we can take the expectation value of Eq. (6.37). We obtain 
d 
q7,n(z) = {1—- E qT-1,N (2). (6.40) 
To start the recursion relation, we note that Yo is the N-dimensional zero matrix for which 
qo(z) = z™ . Hence,* 
ant 
qr,n(z) = (1 — £) ra (6.41) 
dz 


If we apply an extra (1 — +) to Eq. (6.41), we get the following recursion relation: 
qT+1, NC) = 97,n(Z) — Nar,nw-1(), (6.42) 
which is similar to the classic “three-point rule” for Laguerre polynomials: 
LED œ) = LO a) + LEP o). (6.43) 
This allows us to make the identification 
ar.n) = (C DYN! LEI Q). (6.44) 
The correctly normalized average characteristic polynomial finally reads: 


Or.n(z) = (CDYTNN! LE (T2). (6.45) 


4 This relation will be further discussed in the context of finite free convolutions, see Chapter 12. 
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Hence, the average characteristic polynomial of real Wishart matrices is a Laguerre polyno- 
mial, albeit with a slightly different value of œ compared to the one obtained in Eq. (6.31) 
above (a = T — N instead of a = T — N — 2). The difference however becomes small 
when N,T > œ. 


6.3 Unitary Ensembles 


In this section we will discuss the average characteristic polynomial for unitary ensem- 
bles, ensembles of complex Hermitian matrices that are invariant under unitary transfor- 
mations. Although this book mainly deals with real symmetric matrices, more is known 
about complex Hermitian matrices, so we want to give a few general results about these 
matrices that do not have a known equivalent in the real case. The main reason unitary 
ensembles are easier to deal with than orthogonal ones has to do with the Vandermonde 
determinant which is needed to change variables from matrix elements to eigenvalues. 
Recall that 


1 1 E ssn 2 
M ho 3 Les XN 
2 2: 2 2 
| det(A(M))| = | det V/? with V = | ^ 15 A3 ee Ay (6.46) 
N-1 er = 3N- NJ 
rst a) a pas AN 


in the orthogonal case = 1, the absolute value sign is needed to get the correct result. 
In the case 8 = 2, (det V)? is automatically positive and no absolute value is needed. 
The absolute value for 6 = 1 is very hard to deal with analytically, while for 6 = 2 the 
Vandermonde determinant is a polynomial in the eigenvalues. 


6.3.1 Complex Wigner 


In Section 6.1.2 we have shown that the expected characteristic polynomial of a unit vari- 
ance real Wigner matrix is given by the Nth Hermite polynomial properly rescaled. The 
argument relied on two facts: (i) the expectation value of any element of a Wigner matrix 
is zero and (ii) all matrix elements are independent, save for 


[W Wj] = 1/N fori + j. (6.47) 


These two properties are shared by complex Wigner matrices. Therefore, the expected 
characteristic polynomial of a complex Wigner matrix is the same as for a real Wigner 
of the same size. We have shown that for a real or complex Wigner of size N, 


On(z) := Eldet(z1 — W)] = N7 Hy (VN), (6.48) 
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where Hy (x) is the Nth Hermite polynomial, i.e. the Nth monic polynomial orthogonal in 
the following sense: 


ie —x?/2 : . 
H; (x) Hj (xje = 0 wheni + j. (6.49) 
—o0o 


We can actually absorb the factors of N in Qp (z) in the measure exp(—x?/ 2) and realize 
that the polynomial Q(z) is the Nth monic polynomial orthogonal with respect to the 
measure 


wy (x) = exp(—Nx*/2). (6.50) 


There are two important remarks to be made about the orthogonality of Q y (x) with respect 
to the measure wy (x). First, Q y (x) is the Nth in a series of orthogonal polynomials with 
respect to an N-dependent measure. In particular Q(x) for M + N is an orthogonal 
polynomial coming from a different measure. Second, the measure wy (x) is exactly the 
weight coming from the matrix potential exp(—BNV(A)/2) for B = 2 and V(x) = x?/2. 
In Section 6.3.3, we will see that these two statements are true for a general potential V (x) 
when 6 = 2. 


6.3.2 Complex Wishart 


A complex white Wishart matrix can be written as a normalized sum of rank-1 complex 
Hermitian projectors: 


T 
1 
W=5 Dv (6.51) 


where the vectors v; are vectors of 11D complex Gaussian numbers with zero mean and 
normalized as 


E[vvi] = 1. (6.52) 


The derivation of the average characteristic polynomial in the Wishart case in Section 6.2.2 
only used the independence of the vectors v; and the expectation value E[v,;v; ] = 1. So, 
by replacing the matrix transposition by the Hermitian conjugation in the derivation we 
can show that the expected characteristic polynomial of a complex white Wishart of size 
N is also given by a Laguerre polynomial, as in Eq. (6.45). The Laguerre polynomials 
LEW (x) are orthogonal in the sense of Eq. (6.30), with a = T — N. As in the Wigner 
case, we can include the extra factor of T in the orthogonality weight and realize that the 
expected characteristic polynomial of a real or complex Wishart matrix is the Nth monic 
polynomial orthogonal with respect to the weight: 


wy (x) = xT NeT" for0 < x < o0. (6.53) 


This weight function is precisely the single eigenvalue weight, without the Vandermonde 
term, of a complex Hermitian white Wishart of size N (see the footnote on page 46). 
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Note that the normalization of the weight wy (x) is irrelevant: the condition that the poly- 
nomial is monic uniquely determines its normalization. Note as well that the real case is 
given by the same polynomials, i.e. polynomials that are orthogonal with respect to the 
complex weight, which is different from the real weight. 


6.3.3 General Potential V (x) 


The average characteristic polynomial for a matrix of size N in a unitary ensemble with 
potential V (M) is given by 


On (x) := Efdet(z 1 — M)], (6.54) 


which we can express via the joint law of the eigenvalues of M: 


N 
On(z) x f dV x [e - xx) Ae” Ei Vow), (6.55) 


k=1 


where A(x) is the Vandermonde determinant: 


A(x) = | [ox — xz). (6.56) 


k<£ 


We do not need to worry about the normalization of the above expectation value as we know 
that Qy (z) is a monic polynomial of degree N. In other words, the condition Qy (z) = 
z^ + O(zN—!) is sufficient to properly normalize Q y (z). The first step of the computation 
is to combine one of the two Vandermonde determinants with the product of (z — xx): 


N N 
A*(x) [e-o = AC) | [e — x) [C — xx) = A@AC:2), (6.57) 


k=1 k<e k=1 


where A(x; z) is a Vandermonde determinant of N + 1 variables, namely the N variables 
xk and the extra variable z. 
The second step is to write the determinants in the Vandermonde form: 


1 1 1 1 
X1 X2 X3 XN 
2 2 2 2 
A(x) =det} %1 X7 X3 o XN |. (6.58) 
N-1 _N-1 N-I N-1 
Xi Xz X3 XN 


We can add or subtract to any line a multiple of any other line and not change the above 
determinant. By doing so we can transform all monomials x! into a monic polynomial of 
degree k of our choice, so we have 
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1 1 1 bax 1 
Pi(x1) Pi(x2) pi(x3) we. pa (xn) 
A(x) = det | P21) P2(x2) p2(x3) PQN) |. (6.59) 
pN-1x1) pĒn-1(x2) prn-1(x3) ...  Pn-1(Xn) 


We will choose the polynomials p, (x) to be the monic polynomials orthogonal with respect 
to the measure w(x) = e~%”), this will turn out to be extremely useful. We can now 
perform the integral of the vector x in the following expression: 


QN) X f dN x A(x) A(x: zje TN E VOR), (6.60) 


If we expand the two determinants A(x) and A(x; z) as signed sums over all permutations 
and take their product, we realize that in each term each variable x, will appear exactly 
twice in two polynomials, say pn (xk) and pm(x;), but by orthogonality we have 


f dpa) PHODE IY O Eia (6.61) 


where Z, is a normalization constant that will not matter in the end. The only terms that will 
survive are those for which every x, appears in the same polynomial in both determinants. 
For this to happen, the variable z must appear as py (z), the only polynomial not in the first 
determinant. So this trick allows us to conclude with very little effort that 


Qn(z) & prk). (6.62) 


But since both Qj (z) and py (z) are monic, they must be equal. We have just shown that 
for a Hermitian matrix M of size N drawn from a unitary ensemble with potential V (x), 


{[det(z 1 — M)] = pw(z), (6.63) 


where py (x) is the Nth monic orthogonal polynomial with respect to the measure e~ VY), 


It is possible to generalize this result to expectation of products of characteristic poly- 
nomials evaluated at K different points zg, which allows one to study the joint distribution 


of K eigenvalues. We give here the result without proof.” We first define the expectation 
value of a product of K characteristic polynomials: 


Fg (z1,z2, -.- zg) := E[det(z1 1 — M) det(z21 — M)... det(zg1 — M)]. (6.64) 


The multivariate function Fg can be expressed as a determinant of orthogonal 


polynomials: 
PN(Z1) PN (22) ite PN(ZK) 
1 Pn+i(Z1) PN+i(22) © PN+1CERK) 
Fx (Z1,22,---,2K) = — det : : : . , 
A . . . . 
PN+K-1(21) PN+K—-1(22) --- PN+K-1@K) 


(6.65) 


5 See Brézin and Hikami [2011] for a derivation. Note that their formula equivalent to Eq. (6.67) is missing the K -dependent 
constant factor. 
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where A := A(zq,Z2,..-,ZK) is the usual Vandermonde determinant and the p(x) are 
the monic orthogonal polynomials orthogonal with respect to e-NV@) When K = 1, 
A = 1 by definition and we recover our previous result F(z) = py(z). When the 
arguments of Fx are not all different, Eq. (6.65) gives an undetermined result (0/0) but 
the limit is well defined. A useful case is when all arguments are equal: 


Fx (2) := Fx (zz ...,z) = Eldet(z 1 — M)¥]. (6.66) 


Taking the limit of Eq. (6.65) we find the rather simple result 


prQ) KO aw EPO 
PNS MaD «9 a 
F(z) Ka , (6.67) 
Teo £! 3 ; 
K-1 
PN+K-1@) Puki «+ Pori (z) 


where pP (x) is the kth derivative of the th polynomial. In particular the average-square 
characteristic polynomial is given by 


Fa) = py@ Py41@ — Py@Pn+i2)- (6.68) 


Exercise 6.3.1 Variance of the Characteristic Polynomial of a 2 x 2 Hermitian 
Wigner Matrix 


(a) Show that the characteristic polynomial of a 2 x 2 Hermitian Wigner matrix 
is given by 


OY 2) = z — wi) — wz) — (wh) — wh)’, (6.69) 


where w11, W22, wR and wl, are four real independent Gaussian random 
numbers with variance 1 for the first two and 1/2 for the other two. 

(b) Compute directly the mean and the variance of oY (z). 

(c) Use Eqs. (6.63) and (6.68) and the first few Hermite polynomials given in 
Section 6.1.1 to obtain the same result, namely VLOW e) = 27742. 
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The Jacobi Ensemble* 


So far we have encountered two classical random matrix ensembles, namely Wigner and 
Wishart. They are, respectively, the matrix equivalents of the Gaussian and the gamma 
distribution. For example a 1 x 1 Wigner matrix is a single Gaussian random number and 
a 1 x 1 Wishart is a gamma distributed number (see Eq. (4.16) with N = 1). We also 
saw in Chapter 6 that these ensembles are intimately related to classical orthogonal poly- 
nomials, respectively Hermite and Laguerre. The Gaussian distribution and its associated 
Hermite polynomials appear very naturally in contexts where the underlying variable is 
unbounded above and below. Gamma distributions and Laguerre polynomials appear in 
problems where the variable is bounded from below (e.g. positive variables). Variables 
that are bounded both from above and from below have their own natural distribution 
and associated classical orthogonal polynomials, namely the beta distribution and Jacobi 
polynomials. 

In this chapter, we introduce a third classical random matrix ensemble: the Jacobi 
ensemble. It is the random matrix equivalent of the beta distribution (and hence often 
called matrix variate beta distribution). It will turn out to be strongly linked to Jacobi 
orthogonal polynomials. 

Jacobi matrices appear in multivariate analysis of variance and hence the Jacobi ensem- 
ble is sometimes called the MANOVA ensemble. An important special case of the Jacobi 
ensemble is the arcsine law which we already encountered in Section 5.5, and will again 
encounter in Section 15.3.1. It is the law governing Coulomb repelling eigenvalues with 
no external forces save for two hard walls. It also shows up in simple problems of matrix 
addition and multiplications for matrices with only two eigenvalues. 


7.1 Properties of Jacobi Matrices 
7.1.1 Construction of a Jacobi Matrix 


A beta-distributed random variable x € (0,1) has the following law: 


r(e) (c2) 
r(e + c2) 


where cı > 0 and c2 > O are two parameters characterizing the law. 


Pao (x) = xami — x), (7.1) 
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To generalize (7.1) to matrices, we could define J as a matrix generated from a beta 
ensemble with a matrix potential that tends to V(x) = —log(Pe,,c,(x)) in the large N 
limit. Although this is indeed the result we will get in the end, we would rather use a more 
constructive approach that will give us a sensible definition of the matrix J at finite N and 
for the three standard values of £. 

A beta(ci,c2) random number can alternatively be generated from two gamma- 
distributed variables: 

w1 

Xx = ——, W1.2™~ Gamma(c} 2; 1). (7.2) 
wi + w2 

The same relation can be rewritten as 

1 
x= ——. (7.3) 
1+w, we 
This is the formula we need for our matrix generalization. An unnormalized white Wishart 
with T = cN is the matrix generalization of a Gamma(c, 1) random variable. Combining 
two such matrices as above will give us our Jacobi random matrix J. One last point before 
we proceed, we need to symmetrize the combination wy! w2 to yield a symmetric matrix. 
We choose ,/w2 w! „/w2 which makes sense as Wishart matrices (like gamma-distributed 
numbers) are positive definite. 

We can now define the Jacobi matrix. Let E be the symmetrized product of a white 
Wishart and the inverse of another independent white Wishart, both without the usual 1/ T 
normalization: 


E = W3 WoW, where Wiz := HH 3. (1.4) 


The two matrices H4 2 are rectangular matrices of standard Gaussian random numbers with 
aspect ratio c} = T/N and cy = T>/N (note that the usual aspect ratio is q = c7!). The 
standard Jacobi matrix is defined as 


J=0+Ẹ!. (7.5) 


A Jacobi matrix has all its eigenvalues between 0 and 1. For the matrices Wi to make 
sense we need c1,2 > 0. In addition, to ensure that Ww, is invertible we need to impose 
cı > 1. It turns out that we can relax that assumption later and the ensemble still makes 
sense for any c1,2 > 0. 


7.1.2 Joint Law of the Elements 


The joint law of the elements of a Jacobi matrix for B = 1,2 or 4 is given by 


Pp D = ch T P det yo -N+ D2" det = pype tD- (1.6) 


N ra Dry +T -N + j)/2 
har T] zi ad++8/ÐDrET + T2 + j)/2) (7.7) 
j=l 


+ Bj/2)T (BT — N +D E -N + j) 
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over the space of matrices of the proper symmetry such that both J and 1 — J are positive 
definite. 
To obtain this result, one needs to know the law of Wishart matrices (Chapter 4) and the 
law of a matrix given the law of its inverse (Chapter 1). 
Here is the derivation in the real symmetric case. We first write the law of the matrix E 
by realizing that for a fixed matrix W4, the matrix E/T) is a Wishart matrix with T = T> 
and true covariance wy. From Eq. (4.16), we thus have 


(det E)2-N—-)/2 (det W) 22/2 
2NT2/2T N (T2/2) 


P (E|W)) = exp | 5 Tew.) (7.8) 


The matrix Ww, /Ty, is itself a Wishart with probability 


bs (det W,)7i-N-D/2 iy. w oe 
P = j ; 

(Wi) = nA P| 2 | ey 

Averaging Eq. (7.8) with respect to Ww 1 we find 
(det E)(2-N-D/2 
P Œ) = Wait? 
2N Ti+) y (Ty /2)P y (2/2) 
1 

x f dW (det W) tT-N-D/2 exp |- ou (a+ Dw] . (7.10) 


We can perform the integral over W by realizing that W/T is a Wishart matrix with 
T = Tı + T> and true covariance C = (1+ E)~!, see Eq. (4.16). We just need to 
introduce the correct power of det C and numerical factors to make the integral equal to 1. 
Thus, 


PŒ = Py (Ty + T2)/2) 
Py (1 /2)0 ny (12/2) 


The miracle that for any N one can integrate exactly the product of a Wishart matrix and 
the inverse of another Wishart matrix will appear again in the Bayesian theory of scm (see 
Section 18.3). 
Writing E+ = E+ 1 we find 
Un (Ti + T2)/2) 


= (1,—-N-1)/2 -(Tı+T2)/2 
P (Œ) = aa Dra (det(E, — 1)? (detE.)~1+72)/2 (7,12) 


Finally we want J := Ei The law of the inverse of a symmetric matrix A = M`! of 
size N is given by (see Section 1.2.7) 


(det E) PND (det + E) Pt, 11) 


Pa (A) = Py(A7!) det(A) 771. (7.13) 


Hence, 


P()= Py (7, + T2)/2) (aes! B ij (det JyTit72)/2-N-1 | 


~ Ty (11 /2)0 Ny (19/2) 
(7.14) 
Using det(J -1 _ 1) det(J) = det(1— J) we can reorganize the powers of the determinants 
and get 
POQ = Fa((Ti + T2)/2) (det — PN- D/2 det hT 7N-9/2 7.15) 


~ Py (1 /2)0 (12/2) 
which is equivalent to Eq. (7.6) for £ = 1. 
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7.1.3 Potential and Stieltjes Transform 


The Jacobi ensemble is a beta ensemble satisfying Eq. (5.2) with matrix potential 
Tı -N+1—2/p Tn -N+1—2/6 


V(x)=— 1 — log(1 — x). 7.16 
(x) N og(x) N og(1 — x) (7.16) 
The derivative of the potential, in the large N limit, is given by 
-l- —2 
Vaya oe (7.17) 
x(x — 1) 


The function V’(x) is not a polynomial or a Laurent polynomial but the function 
x(x — 1)V’(x) is a degree one polynomial. With a slight modification of the argument of 
Section 5.2.2, we can show that x(x — 1)II(x) = r + sx is a degree one polynomial. In the 
large N limit we then have 


cr 1-4. - Det Yer22 — Mercy +e-)z + (c1 — 1)? 
2z(z — 1) 
where we have used the fact that we need s = 0 andr = 1 + c1 + c2 to get a 1/z behavior 
at infinity; we used the shorthand c+ = c2 Æ c1. 
From the large z limit of (7.18) one can read off the normalized trace (or the average 
eigenvalue): 


g(z) = ; (7.18) 


cl 


TD) = 


; 7.19 
ĉi te ( ) 


which is equal to one-half when cı = c2. For c} > 1 and c2 > 1, there are no poles, and 
eigenvalues exist only when the argument of the square-root is negative. The density of 
eigenvalues is therefore given by (see Fig. 7.1) 


p(A) 


Figure 7.1 Density of eigenvalues for the Jacobi ensemble with cy = 5 and c2 = 2. The histogram 
is a simulation of a single N = 1000 matrix with the same parameters. 
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(Op =O) 


A) = ‘ 7.20 
pte a (7.20) 
where the edges of the spectrum are given by 
+c_+2 -1 
pa AESA (1.21) 
Cy 


For 0 < cy < lor0O < c < 1, Eq. (7.18) will have Dirac deltas at z = 0 or z = 1, 
depending on cases (see Exercise 7.1.1). 
In the symmetric case c1 = c2 = c, we have explicitly 


_ (e= DU- 22) + YerQz— 1)? — ee — 2) 


gE 7.22 
a(z) TE (7.22) 
The density for c > 1 has no Dirac mass and is given by 
(==) 
à) = j 7.23 
ae ae (7.23) 
with the edges given by 
1 V2c-1 
ee paca (7.24) 
2 2c 
Note that the distribution is symmetric around A = 1/2 (see Fig. 7.2). 
As c — 1, the edges tend to 0 and 1 and we recover the arcsine law: 
(A) = (7.25) 
g ma JK — A) 
Figure 7.2 Density of eigenvalues for a Jacobi matrix in the symmetric case (cy = c2 = c) for 


c = 20,2 and 1. The case c = 1 is the arcsine law. 
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Exercise 7.1.1 Dirac masses in the Jacobi density 


(a) Assuming that cı > 1 and c2 > 1 show that there are no poles in Eq. (7.18) at 
z = 0 or z = 1 by showing that the numerator vanishes for these two values 
of z. 

(b) The parameters c1,2 can be smaller than 1 (as long as they are positive). Show 
that in that case g(z) can have poles at z = 0 and/or z = 1 and find the residue 
at these poles. 


7.2 Jacobi Matrices and Jacobi Polynomials 
7.2.1 Centered-Range Jacobi Ensemble 


The standard Jacobi matrix defined above has all its eigenvalues between 0 and 1. We would 
like to use another definition of the Jacobi matrix with eigenvalues between — 1 and 1. This 
will make easier the link with orthogonal polynomials. We define the centered-range Jacobi 
matrix! 


Je=2J-1. (7.26) 


This definition is equivalent to 


Je = Wi! WWI, where Ñi = HiH? + HHE, (1.27) 


with Hj 2 as above. 
The matrix Je is still a member of a beta ensemble satisfying Eq. (5.2) with a slightly 
modified matrix potential: 


NV(x)=-— (Ti —-N+1- 2/8) log +x) — (T2 — N + 1-2/8) log(l — x). (7.28) 
In the large N limit, the density of eigenvalues can easily be obtained from (7.20): 


VO4—-DQ—A-) 


à) = ; 7.29 
P(A) = cy mA — 22) (7.29) 
where the edges of the spectrum are given by 
_(2- +4 —1 
a= c_(2— c+) c1C2 (C4 ) 130) 
c4 
The special case c1 = cz = | is the centered arcsine law: 
1 
pA) = ——. (7.31) 
mVJ/(1— 12) 


| Note that the matrix Je is not necessarily centered in the sense t (Je) = 0 but the potential range of its eigenvalues is [—1, 1] 
centered around zero. 
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7.2.2 Average Expected Characteristic Polynomial 


In Section 6.3.3, we saw that the average characteristic polynomial when 6 = 2 is the Nth 
monic polynomial orthogonal to the weight w(x) x exp(—N V (x)). For the centered-range 
Jacobi matrix we have 


w(x) = (1+ OPNA xP, (7.32) 


The Jacobi polynomials pia?) (x) are precisely orthogonal to such weight functions with 
a = Th — N and b = T — N. (Note that unfortunately the standard order of the param- 
eters is inverted with respect to Jacobi matrices.) Jacobi polynomials satisfy the following 
differential equation: 


(1 — xy” +(b-a-(a+b4+2)x)y +n(n+a+b+1)y=0. (7.33) 


This equation has polynomial solutions if and only if n is an integer. The solution is then 
yx PEP (x). 
The first three Jacobi polynomials are 


PE” (x) = 1, 


p@b,) _ Gt ; ae 5 A (1.34) 
E CCA oa eer. n 
at be +2) 


The normalization of Jacobi polynomials is arbitrary but in the standard normalization 
they are not monic. The coefficient of x” in pia?) (x) is 
_ Pla+b+2n+1] 
~ 2nlTlatb+n— 1] 


(7.35) 


an 


In summary we have (for 6 = 2) 


WNIT(T +T +1- N] o -N,T-N) 
` [det[z1 — Je]] = Prot Z). 7.36 
[det[ cl] Mn+ bth N (z) (7.36) 
Note that we must have T; > N and D > N. 

When Ti = h = N (ie. c1 = co = l, corresponding to the arcsine law), the 
polynomials Po o (z) are called Legendre polynomials Py (z):? 


2N (N!) 
(2N)! 


z [det[z1 — Y,]] = Py (2). (7.38) 


2 Legendre polynomials are defined as the polynomial solution of 


d 
dx 


[« DA] En(n+1)Pr(x)=0, Pa) = 1. (7.37) 
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7.2.3 Maximum Likelihood Configuration at Finite N 


In Chapter 6, we studied the most likely configuration of eigenvalues for the beta ensemble 
at finite N. We saw that the finite-N Stieltjes transform of this solution gy (z) is related to 
a monic polynomial y(x) via 
N N 
1 1 w'(2) 
z) = = , Where x)= x — Ài). 7.39 
gn (2) Deere WO W(x) Ik i) (1.39) 


i=l 


The polynomial w(x) satisfies Eq. (6.23) which we recall here: 

w(x) — NV' œp x) + N7T (x) W(x) = 0, (7.40) 
where the function ITy(x) is defined by Eq. (5.32). For the case of the centered-range 
Jacobi ensemble we have 
a—b+(a+b+2)x 

1— x? 
where we have anticipated the result by introducing the notation a = T, — N — 2/f and 
b = Tı — N — 2/8. The function Iy (x) is given by 


rn +SNX 
2 


NV'(x) = 


(7.41) 


Ty (x) = (7.42) 


1-x 
We will see below that the coefficient sy is zero because of the symmetry {c1,c2,Ax} > 
{c2,C1, — Ax} of the most likely solution. 
The equation for w(x) becomes 


(1 — x?)w"(x) + b-a- (a +b 42)x) W(x) try N? W(x) = 0. (7.43) 


We recognize the differential equation satisfied by the Jacobi polynomials (7.33). Its solu- 
tions are polynomials only if ryN = N + a+b + 1, which implies that ry = c1 + c2 — 
1+ (1 —48)/N. This is consistent with the large N limit r = cy + c2 — 1 in Section 7.1.3. 
The solutions are given by 


v(x) x PEP), (7.44) 


with the proportionality constant chosen such that y(x) is monic. 

In the special case Ti = To = N + 2/f, i.e. T = N + 2 for real symmetric matrices 
and T = N + 1 for complex Hermitian matrices, we have a = b = 0 and the polynomials 
reduce to Legendre polynomials. 

Another special case corresponds to Chebyshev polynomials: 


1 
7s 


it 
2:2 


a ees 
T(x) = pl De) and = U,(x) = al i (7.45) 


where 7;,(x) and Un (x) are the Chebyshev polynomials of first and second kind respec- 
tively. Since a = Th — N — 2/B and b = Tı — N — 2/8, they appear as solutions for 
Tı = Th = N+2/f — 1/2 (first kind) or Ti = To = N + 2/6 + 1/2 (second kind). 
These values of Ti = D are not integers but we can still consider the matrix potential 
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Figure 7.3 Histogram of the 200 zeros of one 600) (x). The full line is the Jacobi eigenvalue density, 


Eq. (7.29), with c1 = 4 and cz = 2. 


given by Eq. (7.28) without the explicit Wishart matrix construction. In the large N limit, 
the density of the zeros of the Jacobi polynomials pee mae) (x) is given by Eq. (7.29) 
with c1,2 = T1,2/N. Figure 7.3 shows a histogram of the zeros of peoo. 600) (x). 

When cı — 1 and c2 — 1, the density becomes the centered arcsine law. As a conse- 
quence, we have shown that the zeros of Chebyshev (both kinds) and Legendre polynomials 
(for which Ti = Tz = N + O(1)) are distributed according to the centered arcsine law in 
the large N limit. 


We have seen that in order for Eq. (7.40) to have polynomial solutions we must have 
N(—x?)My(x) =N+at+bd41, (7.46) 


where the function TI y (x) is defined from the most likely configuration or, equivalently, 
the roots of the Jacobi polynomial Py b) (z) by 


1 5 V'(x) — VOAR) 


Tl = 
N(x) N rar 


(7.47) 
k=0 


From these expressions we can find a relationship that roots of Jacobi polynomials must 
satisfy. Indeed, injecting Eq. (7.41), we find 


N ggm 2_ 42 er ee 
vi-ni E b)(x P+ a+b) AA+ D 


5 (7.48) 
= (1— a2) — Ag) 


For each k the numerator is a second degree polynomial in x that is zero at x = Ax, 
canceling the x — Ax factor in the denominator, so the whole expression is a first degree 
polynomial. Equating this expression to Eq. (7.47), we find that the term linear in x of 
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this polynomial must be zero and the constant term equal to N + a + b + 1, yielding two 


equations: 
N @kG4 Diy + (a-b) 
i k =0, (1.49) 
Z a — 22) 
N (a+b+2) + (a-b) 
2 k Z N+a+b+l. (7.50) 


(= ag) 


These equations give us non-trival relations satisfied by the roots of Jacobi polynomials 
PE 72) (x). If we sum or subtract the two equations above, we finally obtain? 


N 
1 1 +b+N+1 
5 aie (7.51) 
N & (1 — Ag) 2(a + 1) 
es i a+b+N+1 
5 = (7.52) 
+a) b+ 


7.2.4 Discrete Laplacian in One Dimension and Chebyshev Polynomials 


Chebyshev polynomials and the arcsine law are also related via a simple deterministic 
matrix: the discrete Laplacian in one dimension, defined as 


2 -1 0 +. 0 0 
-—1 2 -1 0 0 
i]/0 -1 2 0 0 
A=5]0 0 -i 0 0 (7.53) 
0 0 0 *% 0 -i 
0 0 0 -= -1 2 


We will see that the spectrum of A—1 at large N is again given by the arcsine law. One way 
to obtain this result is to modify the top-right and bottom-left corner elements by adding 
—1. The modified matrix is then a circulant matrix which can be diagonalized exactly (see 
Exercise 7.2.1 and Appendix A.3). 

We will use a different route which will also uncover the link to Chebyshev polynomials. 
We will compute the characteristic polynomial Qy(z) of A — 1 for all values of N by 
induction. The first two are 


Q\(z) =z, (7.54) 


Q2(z) = 2? — A (1.55) 


3 These relations were recently obtained in Alıcı and Taeli [2015]. 
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For N > 3 we can write a recursion relation by expanding in minors the first line of 
the determinant of z1 — A + 1. The (11)-minor is just Q y—1 (z). The first column of the 
(12)-minor only has one element equal to 1/2; if we expand this column its only minor is 
QOwn—2(z). We find 


1 
On (2) = <On-1(%) — 7 Ow-2(2). (7.56) 
This simple recursion relation is similar to that of Chebyshev polynomials Uy (x): 
Un (x) = 2xUn_1 — Un_-2(x). (7.57) 


The standard Chebyshev polynomials are not monic but have leading term Uy (x) ~ 2 x^. 
Monic Chebyshev (U w(x) = 27 Uņ (x)) in fact precisely satisfy Eq. (7.56). Given our 
first two polynomials are the monic Chebyshev of the second kind, we conclude that 
On(z) = mak Ui N (z) for all N. The eigenvalues of A — 1 at size N are therefore given 
by the zeros of the Nth Chebyshev polynomial of the second kind. In the large N limit 
those are distributed according to the centered arcsine law. QED. 

We will see in Section 15.3.1 that the sum of two random symmetric orthogonal matrices 
also has eigenvalues distributed according to the arcsine law. 


Exercise 7.2.1 Diagonalizing the Discrete Laplacian 
Consider M = A — 1 with —1/2 added to the top-right and bottom-left 
corners. 


(a) Show that the vectors [vx]; = e2"ki are eigenvectors of M with eigenvalues 
Àg = cos(27k). 

(b) Show that the eigenvalue density of M in the large N limit is given by the 
centered arcsine law (7.31). 
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Sums and Products of Random Matrices 


8 


Addition of Random Variables and Brownian Motion 


In the following chapters we will be interested in the properties of sums (and products) 
of random matrices. Before embarking on this relatively new field, the present chapter 
will quickly review some classical results concerning sums of random scalars, and the 
corresponding continuous time limit that leads to the Brownian motion and stochastic 
calculus. 


8.1 Sums of Random Variables 


Let us thus consider X = X1+X 2 where X; and X2 are two random variables, independent, 
and distributed according to, respectively, Pı(x1) and P2(x2). The probability that X is 
equal to x (to within dx) is given by the sum over all combinations of x; and x2 such 
that xı + x2 = x, weighted by their respective probabilities. The variables X; and X2 
being independent, the joint probability that X¥; = xı and X2 = x — xı is equal to 
P (x1) Po(x — x1), from which one obtains 


P(x) = / Pi (x) P(x — x’) dx’. (8.1) 


This equation defines the convolution between Pı and P2, which we will write P 2) = 
P| x P2. The generalization to the sum of N independent random variables is immediate. If 
X = Xı + X2 +--+ Xy with X; distributed according to P;(x;), the distribution of X is 
obtained as 


N N 

P= / ] [xi Pi (1) Poa)... Py (æn) ( - Ya) ; (8.2) 
i=l i=l 

where ô(.) is the Dirac delta function. The analytical or numerical manipulations of Eqs. 


(8.1) and (8.2) are much eased by the use of Fourier transforms, for which convolutions 
become simple products. The equation P(x) = [P] x P2](x), reads, in Fourier space, 


p? (k) = f elkarta") Í Pi’) Po(x — x’) dx' dx = g1 (k)p2(k), (8.3) 
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where (k) denotes the Fourier transform of the corresponding probability density P(x). It 
is often called its characteristic (or generating) function. Since the characteristic functions 
multiply, their logarithms add, i.e. the function H (k) defined below is additive: 


H(k) := log (k) = log Efe**]. (8.4) 


It allows one to recovers its so called cumulants cn (provided they are finite) through 
n 


d 
Cn := (—i)” ir (8.5) 


k=0 
The cumulants c, are polynomial combinations of the moments mp with p < n. For 
example cı} = m; is the mean of the distribution and cz = m2 — m? = ø? its variance. It 
is clear that the mean of the sum of two random variables (independent or not) is equal to 
the sum of the individual means. The mean is thus additive under convolution. The same is 
true for the variance, but only for independent variables. 

More generally, from the additive property of H (k) all the cumulants of two independent 
distributions simply add. The additivity of cumulants is a consequence of the linearity 
of derivation. The cumulants of a given law convoluted N times with itself thus follow 
the simple rule cny = Ncn,1, where the {c,,1} are the cumulants of the elementary 
distribution P|. 

An important case is when P| is a Gaussian distribution, 


1 _@m? 
Pix) = —— e 2%, (8.6) 
V2n02 

such that log g1 (k) = imk — ok? /2. The Gaussian distribution is such that all cumulants 
of order > 3 are zero. This property is clearly preserved under convolution: the sum of 
Gaussian random variables remains Gaussian. Conversely, one can always write a Gaussian 
variable as a sum of an arbitrary number of Gaussian variables: Gaussian variables are 
infinitely divisible. 

In the following, we will consider infinitesimal Gaussian variables, noted dB, such that 
S[dB] = 0 and z[dB?] = dt, where dt — 0 is an infinitesimal quantity, which we will 
interpret as an infinitesimal time increment. In other words, dB is a mean zero Gaussian 
random variable which has fluctuations of order df. 


8.2 Stochastic Calculus 
8.2.1 Brownian Motion 


The starting point of stochastic calculus is the Brownian motion (also called Wiener pro- 
cess) X;, which is a Gaussian random variable of mean pt and variance o*t. From the 
infinite divisibility property of Gaussian variables, one can always write 


k-1 k-1 
Xy = > pot + yy od Be, (8.7) 
£=0 £=0 
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Figure 8.1 An example of Brownian motion. 


where tz = kt/N,0 < k < N, ôt = T/N and ôB ~ N(0,6t) for each £. By construction 
X (tn) = X;. In the limit N — ov, we have ôt — dt, 5B, — dB and X, becomes a 
continuous time process with 


where dB; are independent, infinitesimal Gaussian variables as defined above (see Fig. 8.1). 
The process X; is continuous but nowhere differentiable. Note that X; and Xy are not 
independent but their increments are, i.e. Xy and X; — Xy are independent whenever t’ < t. 
Note the convention that X+, is built from past increments 5 By for £ < k but does not 
include 5 By. This convention is called the Itô prescription.! Its main advantage is that X; 
is independent of the equal-time dB;, but this comes at a price: the usual chain rule for 
differentiation has to be corrected by the so-called It6 term, which we now discuss. 


8.2.2 It6’s Lemma 


We now study the behavior of functions F (X,) of a Wiener process X;. Because dB? is of 

order dr, and not dt?, one has to be careful when evaluating derivatives of functions of X;. 

Given a twice differentiable function F'(.), we consider the process F'(X;). Reverting 
for a moment to a discretized version of the process, one has 
(5X)? 


F(X(t + 6t)) = F(X;) + 6X F'(X,) + 5 PX) + o(ôt), (8.9) 


1 In the Stratonovich prescription, half of 5B, contributes to Xz, . In this prescription, the It6 lemma is not needed, i.e. the chain 
tule applies without any correction term, but the price to pay is a correlation between X; and dB(t). We will not use the 
Stratonovich prescription in this book. 
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where 

dX = ust +0ôB, (8.10) 
and 


(8X)? = WGH? +0251 +0? |68}? = ôr | + Quo dt6B. (8.11) 


The random variable (5B)? has mean o6t so the first and last terms are clearly o(ôt) when 
ôt — 0. The third term has standard deviation given by /2076r; so the third term is also 
of order ôt but of zero mean. It is thus a random term much like ôB, but much smaller since 
ôB is of order \/5t >> ôt. The Itô lemma is a precise mathematical statement that justifies 
why this term can be neglected to first order in ôt. Hence, letting ôt —> dt, we get 

ð o? ə F 


F 
dF; = —dX dt. 8.12 
= 7x t x i) 


When compared to ordinary calculus, there is a correction term — the Itô term — that depends 
on the second order derivative of F. 

More generally, we can consider a general Itô process where u and o themselves depend 
on X; and f, i.e. 


dX, = w(X;,t)dt + o (X, t)dB;. (8.13) 


Then, for functions F (X,t) that may have an explicit time dependence, one has 


2 2 
OF  o(X;,,t) 0 =] (8.14) 


in ae 
=a at 2 ZX 


Itô’s lemma can be extended to functions of several stochastic variables. Consider a 
collection of N independent stochastic variables {X;,;} (written vectorially as X;) and 
such that 


dX; = ui (Xa, tdt + dWi, (8.15) 


where dW; ; are Wiener noises such that 


i [AW; dW; 1] = Ci (Xi, tdt. (8.16) 


The vectorial form of It6’s lemma states that the time evolution of a function F (X;,t) is 
given by the sum of three contributions: 


N N 
OF OF Ci (X,t) 3 F 
f 23x, nET r A, 2 AX;0X; 


dt. (8.17) 


The formula simplifies when all the Wiener noises are independent, in which case the Itô 
term only contains the second derivatives 3? F /3?X;. 


8.2 Stochastic Calculus 115 


8.2.3 Variance as a Function of Time 


As an illustration of how to use It6’s formula let us recompute the time dependent variance 
of X,. Assume u = 0 and choose F(X) = X?. Applying Eq. (8.12), we get that 


t 
dF, = 2X,dX; + o7dt > F(X;) = af oX,dB, +0°t. (8.18) 
0 


In order to take the expectation value of this equation, we scrutinize the term E[X,dB,]. As 
alluded to above, the random infinitesimal element dB, does not contribute to X,, which 
only depends on dB, <s. Therefore E[X,dB,] = 0, and, as expected, 


ELX?] = E[F(X,)] = 071. (8.19) 


The Brownian motion has a variance from the origin that grows linearly with time. The 
same result can of course be derived directly from the integrated form X; = o B;, where B; 
is a Gaussian random number of variance equal to t. 


8.2.4 Gaussian Addition 


It6’s lemma can be used to compute a special case of the law of addition of independent 
random variables, namely when one of the variables is Gaussian. Consider the random 
variable Z = Y + X, where Y is some random variable, and X is an independent Gaussian 
(X ~ N(u, o*)). The law of Z is uniquely determined by its characteristic function: 


gk) := Efel*7]. (8.20) 


We now let Z —> Z, be a Brownian motion with Zp = Y: 
dZ, = pdt+odB,, Zo=Y. (8.21) 


Note that Z,;—; has the same law as Z. The idea is now to study the function F(Z;) := elkZ: 
using It6’s lemma, Eq. (8.14). Hence, 


dF, = ikeikZ:4 _ ko? ikZ,4, [: _ ko? . 
1 = ike Zt So edt = | iku F ce F | dt + ikFdB;. (8.22) 


Taking the expectation value, writing g; (k) = E[F(t)], and noting that the differential d is 
a linear operator and therefore commutes with the expectation value, we obtain 


k?o? 
dø; (k) = (iru = =) pı (k)dt, (8.23) 
or 
E N mele 8.24 
mm de > ae og (oh) = (iku - EE). (8.24) 
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From its solution at t = 1, we get 
22 
log (91 (k)) = log (go(k)) + iku — ——. (8.25) 


Recognizing the last two terms in the right hand side as the characteristic function of a 
Gaussian variable, we recover the fact that the log-characteristic function is additive under 
the addition of independent random variables. Although the result is true in general, the 
calculation above using stochastic calculus is only valid if one of the random variable is 
Gaussian. 


8.2.5 The Langevin Equation 


We would like to construct a stochastic process for a variable X; such that in the steady- 
state regime the values of X; are drawn from a given probability distribution P(x). To build 
our stochastic process, let us first consider the simple Brownian motion with unit variance 
per unit time: 


As revealed by Eq. (8.19) the variance of X; grows linearly with time and the process 
never reaches a stationary state. To make it stationary we need a mechanism to limit the 
variance of X;. We cannot ‘subtract’ variance but we can reduce X, by scaling. If at every 
infinitesimal time step we replace X;+a; by Xt-+ar/ /1 + dr, the variance of X, will remain 
equal to unity. We also know that the distribution of X; is Gaussian (if the initial condition is 
Gaussian or constant). With this extra rescaling, X; is still Gaussian at every step, so clearly 
this will describe the stationary state of our rescaled process. As a stochastic differential 
equation, we have, neglecting terms of order (dr)?/* 


dx, = dB; + = — Xı = dB; — xat. (8.27) 
This stationary version of the random walk is the Ornstein—Uhlenbeck process (see 
Fig. 8.2). A physical interpretation of this equation is that of a particle located at X; 
moving in a viscous medium subjected to random forces dB,/dt and a deterministic 
harmonic force (“spring”) —X;/2. The viscous medium is such that velocity (and not 
acceleration) is proportional to force. 

We would like to generalize the above formalism to generate any distribution P(x) for 
the distribution of X in the stationary state. One way to do so is to change the linear force 
—X,/2 to a general non-linear force F(X;) := —V‘(X;)/2, where we have written the 
force as the derivative of a potential V and introduced a factor of 2 which will prove to be 
convenient. If the potential is convex, the force will drive the particle towards the minimum 
of the potential while the noise dB; will drive the particle away. We expect that this system 
will reach a steady state. Our stochastic equation is now 
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Figure 8.2 (left) A simulation of the Langevin equation for the Ornstein—Uhlenbeck process (8.27) 
with 50 steps per unit time. Note that the correlation time is Te = 2 here, so excursions away from 
zero typically take 2 time units to mean-revert back to zero. Farther excursions take longer to come 
back. (right) Histogram of the values of X; for the same process simulated up to t = 2000 and 
comparison with the normal distribution. The agreement varies from sample to sample as a rare far 
excursion can affect the sample distribution even for t = 2000. 


What is the distribution of X, in the steady state? To find out, let us consider an arbi- 
trary test function f(X;) and see how it behaves in the steady state. Using It6’s lemma, 
Eq. (8.12), we have 


1 1 
df (X) = f(X) |as, = zvar + zf Xd. (8.29) 


Taking the expectation value on both sides and demanding that dE[ f;]/dt = 0 in the steady 
state we find 


R [F (X)V'(X)] =E[f"(X)]. (8.30) 


This must be true for any function f. In order to infer the corresponding stationary distri- 
bution P(x), let us write h(x) = f’(x) and write these expectation values as 


[rcov'wreoar = f KOP a (8.31) 
Since we want to relate an integral of h to one of h’ we should use integration by parts: 
/ / P'(x) 
h(x) P(x)dx = — | h(x) P'(x)dx = — | h(x) P(x) P(x)dx. (8.32) 
x 


Since Eq. (8.31) is true for any function A(x) we must have 
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P'(x) 
P(x) 


where Z is an integration constant that fixes the normalization of the law P(x). 
To recapitulate, given a probability density P(x), we can define a potential V(x) = 


Viixy=- = P(x) =Z~!exp[—V(x)], (8.33) 


— log P(x) (up to an irrelevant additive constant) and consider the stochastic differential 
equation 


1 


The stochastic variable X; will eventually reach a steady state. In that steady state the law of 
X, will be given by P(x). Equation (8.34) is called the Langevin equation. The strength of 
the Langevin equation is that it allows one to replace the average over the probability P (x) 
by a sample average over time of a stochastic process.” Any rescaling of time t > o7t 
would yield a Langevin equation with the same stationary state: 


2 
dX; = odB, — TV’ Xd. (8.35) 


We have learned another useful fact from Eq. (8.30): the random variable V’(X) acts 
as a derivative with respect to X under the expectation value. In that sense V’(X) can be 
considered the conjugate variable to X. 

It is very straightforward to generalize our one-dimensional Langevin equation to a set 
of N variables {X;} that are drawn from the joint law P(x) = Z =i exp[— V (x)]. We get 


2 
o ð 
dX; = odB; + 2 aX, log P(x) dt, (8.36) 


where we have dropped the subscript t for clarity. 
Exercise 8.2.1 Langevin equation for Student’s t-distributions 


The family of Student’s t-distributions, parameterized by the tail exponent u, 
is given by the probability density 


1 g = 1 A CJ 
AGS) = Ae (1 + 5) with Aa = ———_. (8.37) 
ee Wee A Jee (8) 
(a) What is the potential V (x) and its derivative V’(x) for these laws? 
(b) Using Eq. (8.30), show that for a t-distributed variable x we have 
2 
x 1 

E = : 8.38 
E + sl l+p a 


2 A process for which the time evolution samples the entire set of possible values according to the stationary probability is 
called ergodic. A discussion of the condition for ergodicity is beyond the scope of this book. 
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(c) Write the Langevin equation for a Student’s t-distribution. What is the 
u — œ limit of this equation? 

(d) Simulate your Langevin equation for u = 3, 20 time steps per unit time and 
run the simulation for 20 000 units of time. Make a normalized histogram of 
the sampled values of X, and compare with the law for u = 3 given above. 

(e) Compared to the Gaussian process (Ornstein—-Uhlenbeck), the Student 
t-process has many more short excursions but the long excursions are much 
longer than the Gaussian ones. Explain this behavior by comparing the 
function V’(x) in the two cases. Describe their relative small |x| and large 
|x| behavior. 


8.2.6 The Fokker—Planck Equation 


It is interesting to derive, from the Langevin equation Eq. (8.28), the so-called Fokker— 
Planck equation that describes the dynamical evolution of the time dependent probability 
density P(x,t) of the random variable X;. The trick is to use Eq. (8.29) again, with f(x) 
an arbitrary test function. Taking expectations, one finds 


1 
dE[ f (X;)] = E[ f’(X,) F(X,)dt] + z aL f” (X1) dt, (8.39) 


where we have used the fact that, in the It6 convention, E[ f’(X;)dB,] = 0. But by defini- 
tion of P(x,t), one also has 


{Lf (Xo = | reyP(nar. (8.40) 


Hence, 


Ə P (x,t) ; 1 7 
[ro pP a= fy OFOP | S (x)P(x,t)dx. (8.41) 


Integrating by parts the right-hand side leads to 

3P (x,t OF (x) P(x,t 1 a Pt 

[ro Oe py - f to EOD ipa [rota (8.42) 
ot Ox 2 3x? 

Since this equation holds for an arbitrary test function f (x), it must be that 

IP, IFP, o? a Pig) 
at ax 2 ax? 

which is called the Fokker—Planck equation. We have reintroduced an arbitrary value 


of o here, to make the equation more general. One can easily check that when F(x) = 
—V’(x)/2, the stationary state of this equation, such that the left hand side is zero, is 


; (8.43) 


P(x) = Z| exp[-V (x)/o?], (8.44) 


as expected from the previous section. 
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Dyson Brownian Motion 


In this chapter we would like to start our investigation of the addition of random matrices, 
which will lead to the theory of so-called “free matrices”. This topic has attracted substan- 
tial interest in recent years and will be covered in the next chapters. 

We will start by studying how a fixed large matrix (random or not) is modified when 
one adds a Wigner matrix. The elements of a Wigner matrix are Gaussian random numbers 
and, as we saw in the previous chapter, each of them can be written as a sum of Gaussian 
numbers with smaller variance. By pushing this reasoning to the limit we can write the 
addition of a Wigner matrix as a continuous process of addition of infinitesimal Wigner 
matrices. Such a matrix Brownian motion process, viewed through the lens of eigenvalues 
and eigenvectors, is called a Dyson Brownian motion (DBM) after the physicist Freeman 
Dyson who first introduced it in 1962. 


9.1 Dyson Brownian Motion I: Perturbation Theory 
9.1.1 Perturbation Theory: A Short Primer 


We begin by recalling how perturbation theory works for eigenvalues and eigenvectors. 
This is a standard topic in elementary quantum mechanics, but is not necessarily well 
known in other circles. Let us consider a matrix H = Hp + €H, where Hp is a real 
symmetric matrix whose eigenvalues and eigenvectors are assumed to be known, and H; is 
a real symmetric matrix that gives the perturbation, with € a small parameter. (In quantum 
mechanics, Hg and H; are complex Hermitian, but the final equations below are the same 
in both cases.) 

Suppose à; o, 1 < i < N, are the eigenvalues of Ho and v; ọ, 1 < i < N, are the 
corresponding eigenvectors. We assume that the perturbed eigenvalues and eigenvectors 
are given by the series expansion (in €): 


oO 0O 
Mi= Aio t oe ARK Vi =Viot > ewe, (9.1) 
k=1 k=1 


with the constraint that 


Ivl =IIlviol=1, 1<i<N. (9.2) 
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Since the quantity ||v;|| is constant, its first order variation with respect to € must be zero. 
This constraint gives that v;,; L vj,0. 

We assume that the À; o are all different, i.e. we consider non-degenerate perturbation 
theory. Then, plugging (9.1) into 


Hv; = AiV; (9.3) 


and matching the left and right hand side term by term in powers of €, one obtains 


Hı); 
Ai = dio + eH); +e? 2 nae +O(e), (9.4) 


ie 


where (H1);j := Vj oHivi,o, and 


wari Ma Dy -Yo +0). (9.5) 
j=1 Ai,o À 


Ti 


Notice that the first order correction to v; is indeed perpendicular to v; o as it does not have 
a component in that direction. 


9.1.2 A Stochastic Differential Equation for Eigenvalues 


Next we use the above formulas to derive the so-called Dyson Brownian motion (DBM), 
which gives the evolution of eigenvalues of a random matrix plus a Wigner ensemble whose 
variance grows linearly with time. Let Mo be the initial matrix (random or not), and X; be a 
unit Wigner matrix that is independent of Mo. Then we study, using (9.4), the eigenvalues of 


M =Mo + vdt Xj, (9.6) 


where dż is a small quantity which will be interpreted as a differential time step. 

The derivation of the sDE for eigenvalues is much simpler if we use the rotational 
invariance of the Wigner ensemble. The matrix X; has the same law in any basis, we 
therefore choose to express it in the diagonal basis of Mo. In order to do so we must work 
with the exact rotationally invariant Wigner ensemble where the diagonal variance is twice 
the off-diagonal variance. 

First, for the first order term (in terms of €), we have 


2 
(X1)ii = Vi oX1Vio ~N (0 x) ; (9.7) 


Note that the (X1 );; are independent for different i’s. 
Then we study the second order term. We have 


1 
(Xi) ji := Vj,0X1Vi,0 ~N (o x) : (9.8) 
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Figure 9.1 A simulation of DBM for an N = 25 matrix starting for a Wigner matrix with o? =1 /4 
and evolving for one unit of time. 


So |(X1) ji K is a random variable with mean 1/N and with fluctuations around the mean 
also of order 1/N. As in Section 8.2.2, one can argue that these fluctuations are negligible 
when integrated over time in the limit dt — 0. In other words, |(X1)j; |? can be treated as 
deterministic. 

Now using (9.4), we get that 


N 
2 1 dt 

dà; = dB; , 9.9 

rs 2 Woy (9.9) 


BN 
j#i 
where dB; denotes a Brownian increment that comes from the (Xj)j;; term, and we have 
added a factor 6 for completeness (equal to 1 for real symmetric matrices). This is the 
fundamental evolution equation for eigenvalues in a fictitious time that describes how much 
of a Wigner matrix one progressively adds to the “initial” matrix Mo (see Fig. 9.1). The 
astute reader will probably have recognized in the second term a Coulomb force deriving 
from the logarithmic Coulomb potential log |A; — A; | encountered in Chapter 5. 
One can also derive a similar process for the eigenvectors that we give here for 6 = 1: 


N 


1 dB; ; 1 dt 

dv; = i> 9.10 

Vi N 2 J — Aj Vj 2N 2 (Ài = ie ( ) 
iti iti 


where dB;; = dBj;; (i + j) is asymmetric collection of Brownian motions, independent of 
each other and of the {dB;} above. 

The formulas (9.9) and (9.10) give the Dyson Brownian motion for the stochastic evo- 
lution of the eigenvalues and eigenvectors of matrices of the form 
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M=Mo+ X, (9.11) 


where Mọ is some initial matrix and X; is an independent Wigner ensemble with parameter 
o? = t. We will call the above matrix process a matrix Dyson Brownian motion. 

In our study of large random matrices, we will be interested in the DBM when N is large, 
but actually the DBM is well defined for any N. As for the It6 lemma, we used the fact 
that the Gaussian process can be divided into infinitesimal increments and that perturbation 
theory becomes exact at that scale. We made no assumption about the size of N. We did 
need a rotationally invariant Gaussian process so the diagonal variance must be twice the 
off-diagonal one. In the most extreme example of N = 1, the eigenvalue of a 1 x 1 matrix 
is just the value of its only element. Under DBM it simply undergoes a standard Brownian 
motion with a variance of 2 per unit time. 


Exercise 9.1.1 Variance as a function of time under DBM 
Consider the Dyson Brownian motion for a finite N matrix: 


2 e dy 
ie =) aR. 9.12 
i=Vqy “Lan (9.12) 


j#i 


and the function F ({å;}) that computes the second moment: 


1 N 
UUDET (9.13) 
n=l 


(a) Write down the stochastic process for F({A;}) using the Itô vectorial formula 
(8.17). In the case at hand F does not depend explicitly on time and On = 
2/N. You will need to use the following identity: 


ee N ee 
2 = 1 J VN 1). 9.14 
zZ Ài— Aj pu hi — Aj ee 
i=l a = 
j#i J#i 


(b) Take the expectation value of your equation and show that F(t) := 
ULF ({Ai(t)})] follows 


N+1 


mO NO 


t. (9.15) 


Do not assume that N is large. 


9.2 Dyson Brownian Motion II: It6 Calculus 


Another way to derive the Dyson Brownian motion for the eigenvalues is to consider the 
matrix Brownian motion (9.11) as a Brownian motion of all the elements of the matrix X. 
We have to treat the diagonal and off-diagonal elements separately because we want to use 
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the rotationally invariant Wigner matrix (GOE) with diagonal variance equal to twice the 
off-diagonal variance. Also, only half of the off-diagonal elements are independent (the 
matrix X is symmetric). We have 


| 2 / 1 
dXkk = gier and dXxe = ge for k <8, (9.16) 


where dBgg and dBye are N and N(N — 1)/2 independent unit Brownian motions. 

Each eigenvalue A; is a (complicated) function of all the matrix elements of X. We can 
use the vectorial form of It6’s lemma (8.17) to write a stochastic differential equation (SDE) 
for A; CX): 


N N N N 
aa; | aa; J1 7A; dt 3A; dt 
: D e kk 2 ax N ke 2 x2 N B 3X2, 2N 
l=k+1 l=k+1 
(9.17) 
The key is to be able to compute the following partial derivatives: 
dA; dA; a7 Aj 7A; 
—; —; = =, (9.18) 
OX kk OX ke OX, 3X 
where k < £. 
Since X; is rotational invariant, we can rotate the basis such that Xo is diagonal: 
Xo = diag(A;(0),...,Aw(O)). (9.19) 


In order to compute the partial derivatives above, we can consider adding one small element 
to the matrix Xo, and compute the corresponding change in eigenvalues. We first perturb 
diagonal elements and later we will deal with off-diagonal elements. 

A shift of the kth diagonal entry of Xo by 6 Xxx affects A; with i = k in a linear fashion 
but leaves all other eigenvalues unaffected: 


Ai —> Ài + OX ESR. (9.20) 
Thus we have 
Ai aA; 
= bik; =0. 9.21 
IXu ik 3X2, ( ) 


Next we discuss how a perturbation in an off-diagonal entry of Xo can affect the eigen- 
values. A perturbation of the (k£) entry by 6Xx¢ = 5X¢x entry leads to the following 
matrix: 


AY 
Xo + 5X = = (9.22) 


bX ke he 


AN 
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Since this matrix is block diagonal (after a simple permutation), the eigenvalues in the N —2 
other 1 x 1 blocks are not affected by the perturbation, so trivially 


OX; 07a; : 
= 0; z = 0, Vi + k,£. (9.23) 
ƏXke aXe, 
On the other hand, the eigenvalues of the block 
Àk ôXke 
( SX de ) (9.24) 


are modified and and exact diagonalization gives 


àkt ìe , Ak — de A(SXx0)* 
À+ + 1 : 9.25 
~ 2 2 y y (Ak — A¢)? R 


We can expand this result to second order in 6 Xx¢ to find 


Xr)? Xr)? 
CX) and Àg — ùe + Xr) : 
Àk — he he — Àk 
We thus readily see that the first partial derivative of 4; with respect to any off-diagonal 
element is zero: 


Àk —> Àk + (9.26) 


OX; 


=O fork < £. (9.27) 
OXxke 
For the second derivative, on the other hand, we obtain 
7A; 25; 26; 
BN y EE Pe ey (9.28) 


3X7; ~ ài àe AG Àk 


Of the two terms on the right hand side, the first term exists only if £ > i while the second 
term is only present when k < i. So, for a given i, only N — 1 terms of the form 2/(A; — à j) 
are present (note that the problematic term i = j is absent). Putting everything back into 
Eq. (9.17), we find, with 6 = 1 here, 


2 1Ù d 
disz dB; ; 9.29 
paian Cy a (9.29) 


j+i 


where dB; are independent Brownian motions (the old dB for k = i). We have thus 
precisely recovered Equation (9.9) using Itô’s calculus. 


9.3 The Dyson Brownian Motion for the Resolvent 
9.3.1 A Burgers’ Equation for the Stieltjes Transform 


Consider a matrix M; that undergoes a DBM starting from a matrix Mo. At each time t, M; 
can be viewed as the sum of the matrix Mo and a Wigner matrix X; of variance t: 


M, = Mo + X;. (9.30) 
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In order to characterize the spectrum of the matrix M;, one should compute its Stieltjes 
transform g(z), which is the expectation value of the large N limit of function gy(z), 
defined by 


N 


1 1 
gn GAD: = = 3 = (9.31) 


gw is thus a function of all eigenvalues {A;} that undergo DBM while z is just a constant 
parameter. Since the eigenvalues evolve with time gy is really a function of both z and t. 
We can use It6’s lemma to write a SDE for gy (z, {A;}). The ingredients are, as usual now, 
the following partial derivatives: 

dgn 1l 1 gy 2 1 


da; N (Ai)? aAa? NE-A) a 


We can now apply Itô’s lemma (8.17) using the dynamical equation for eigenvalues (9.9) 
to find 


N N N 
1 /2 dB; 1 dt 2 dt 
dgy = + + ‘ 
oN NVN 2 («a M 2 (APO A) N? 2 «a 
j#i 


(9.33) 


We now massage the second term to arrive at a symmetric form ini and j. In order to do 
so, note that i and j are dummy indices that are summed over, so we can rename i > j 
and vice versa and get the same expression. Adding the two versions and dividing by 2, we 
get that this term is equal to 


1 > dt PE. y l dt : dt 
N? jal (z = ài) Ai — Aj) 2N? jal (z= Ai)? —Aj) APA; ài) 


jži iti 
ol N (2z— hi = di ol > ap 
~ 2N? 2 (z= Ai)? (z= Aj) N? = a E IE 
Jai ja 
N N i 
: 2 l dt ðgN 1 dt 
2 = dt , 


(9.34) 


Note that very similar manipulations have been used in Section 5.2.2. Thus we have 
N N 
1 /2 dB; OgN 1 dt 
dgy = dt + 
en NYNO Gap oN az Ma Gay 


N 
ı h dB; agn 1 3?gN 
= dt dt. 9.35 
NVN GaP N az UT ON az S 
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Taking now the expectation of this equation (such that the dB; term vanishes), we get 


ð 1 fa? 
[dgn (z)] = — [ero so | dt + l o| dr. (9.36) 


This equation is exact for any N. We can now take the N — œ limit, where gy (z) > 
g: (z). Using the fact that the Stieltjes transform is self-averaging, we get a PDE for the time 
dependent Stieltjes transform g; (z): 


dale) (dan) 

at sf əz ` 

Equation (9.37) is called the inviscid Burgers’ equation. Such an equation can in fact 

develop singularities, so it often needs to be regularized by a viscosity term, which is in 

fact present for finite N: it is precisely the last term of Eq. (9.36). Equation (9.37) can be 

exactly solved using the methods of characteristics. This will be the topic of Section 10.1 
in the next chapter. 


(9.37) 


9.3.2 The Evolution of the Resolvent 


Let us now consider the full matrix resolvent of M;, defined as G; (z) = (z1 — M,)~!. 
Clearly, the quantity g;(z) studied above is simply the trace of G;(z), but G;(z) also 
contains information on the evolution of eigenvectors. Since each element of G; depends 
on all the elements of Mz, one can again use Itd’s calculus to derive an evolution equation 
for G;(z). The calculation is more involved because one needs to carefully keep track of 
all indices. In this technical section, we sketch the derivation of Dyson Brownian motion 
for the resolvent and briefly discuss the result, which will be used further in Section 10.1. 

Since M; = Mo + Xz, the It6 lemma gives 


N, A 32G 


ƏGij 1 ij 
dG;; = -dX = — d XX, ; 9.38 
ij) 2 TR tp Da A [XkeXmn] (9.38) 


where the last term denotes the covariation of X;¢ and Xmn, and we have considered Mg; 
and Mj, to be independent variables following 100% correlated Brownian motions. Next, 
we compute the derivatives 


ƏGij 1 


e [GikG je +G;jkGie], (9.39) 
from which we deduce the second derivatives 
PGi 1 [(GimGkn + GimGkn) Gje +--+] aa 
ƏMkeðMmn 4 im kn imGkn) Gje +--+], 


where we have not written the other GGG products obtained by applying Eq. (9.39) twice. 
Now, using the properties of the Brownian noise, the quadratic covariation reads 


dt 
d[ XxeXmun | = N (benden D dinden ); (9.41) 


so that we get from (9.38) and taking into account symmetries: 


N N 
1 
dG;j(z,t) = > GikG jedXke + N J (GikGekGij + GikGrjGee)dt. (9.42) 
k,£=1 k,t=1 
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If we now take the average over the Brownian motions dXx¢, we find the following 
evolution for the average resolvent: 


1 
ELG: (z)] = Tr Gr (z) E[G?(z)] + EIC? (z)]. (9.43) 
Now, one can notice that 
1 
G? t) = -83G t); N= 502,60), (9.44) 


which hold even before averaging. By sending N — oo, we then obtain the following 
matrix PDE for the resolvent: 


dr E[G; (z)] = -9r (2) 0zE[G;(z)], with E[Go(z)] = Gm C). (9.45) 


Note that this equation is linear in G;(z) once the Stieltjes transform g;(z) is known. 
Taking the trace of Eq. (9.45) immediately leads back to the Burgers’ equation (9.37) for 
gr (z) itself, as expected. 


9.4 The Dyson Brownian Motion with a Potential 
9.4.1 A Modified Langevin Equation for Eigenvalues 


In this section, we modify Dyson Brownian motion by adding a potential such that the 
stationary state of these interacting random walks coincides with the eigenvalue measure 
of B-ensembles, namely 


N N 
-1 B 
P({Ai}) = Zp’ exp z Ss NVA) — 22 log |Ai —Ajl | }- (9.46) 
i=l i, j=1 
j+i 
The general vectorial Langevin equation (8.36) leading to such an equilibrium with o? = 
2/N immediately gives us the following DBM in a potential V (À): 


N 
2 1 B B 
da = \/ —dB 5 V' (ax) | dr, 9.47 
Sy Ol aa ee a? 


which recovers Eq. (9.9) in the absence of a potential. See Figure 9.2 for an illustration. 

Dyson Brownian motion in a potential has many applications. Numerically it can be used 
to generate matrices for an arbitrary potential and an arbitrary value of 6, a task not obvious 
a priori from the definition (9.46). Figure 9.2 shows a simulation of the matrix potential 
studied in Section 5.3.3. Note that DBM generates the correct density of eigenvalues; it also 
generates the proper statistics for the joint distribution of eigenvalues. 

It is interesting to see how Burgers’ equation for the Stieltjes transform, Eq. (9.37), is 
changed in the case where V(A) = A*/2, i.e. in the standard Gor case B = 1. Redoing 
the steps leading to Eq. (9.36) with the extra V’ term in the right hand side of Eq. (9.47) 
modifies the Burgers’ equation into 
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p(a) 
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Figure 9.2 (left) A simulation of DBM with a potential for an N = 25 matrix starting from a Wigner 


matrix with ø? = 1/10 and evolving within the potential V(x) = 5 + A for 10 units of time. Note 
that the steady state is quickly reached (within one or two units of time). (right) Histogram of the 
eigenvalues for the same process for N = 400 and 200 discrete steps per unit time. The histogram 
is over all matrices from time 3 to 10 (560 000 points). The agreement with the theoretical density 
(Eq. (5.58)) is very good. 


Ogr(z) dar(z) | 1 Agar (Z)) 
3i = —91(z) eS e (9.48) 


The solution to this equation will be discussed in the next chapter. 

More theoretically, DBM can be used in proofs of local universality, which is one of the 
most important results in random matrix theory. Local universality is the concept that many 
properties of the joint law of eigenvalues do not depend on the specifics of the random 
matrix in question, provided one looks at them on a scale comparable to the average 
distance between eigenvalues, i.e. on scales NT! < & « 1. Many such properties arise 
from the logarithmic eigenvalue repulsion and indeed depend only on the symmetry class 
(£) of the model. 

Another useful property of DBM is its speed of convergence to the steady state. With time 
normalized as in Eq. (9.47), global properties (such as the density of eigenvalues) converge 
in a time of order 1, as we discuss in the next subsection. Local properties on the other hand 
(e.g. eigenvalue spacing) converge much faster, in a time of order 1/N, i.e. as soon as the 
eigenvalues have “collided” a few times with one another. ! 


1 The time needed for two Brownian motions a distance d apart to meet for the first time is of order a? /o?. In our 
case d = 1/N (the typical distance between two eigenvalues) and o? = 2/N. The typical collision time is therefore (2N jet, 
Note however that eigenvalues actually never cross under Eq. (9.47), but the corresponding eigenvectors strongly mix when 
such quasi-collisions occur. 
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Exercise 9.4.1 Moments under DBM with a potential 
Consider the moments of the eigenvalues as a function of time: 


N 
1 
RO) = = DO, (9.49) 
for eigenvalues undergoing DBM under a potential V(x), Eq. (9.47), in the 


orthogonal case 6 = 1. In this exercise you will need to show (and to use) 
the following identity: 


N Ne k N k 
i 2 ke 
2 ) F =r => T D > pat (9.50) 
i, j=l i,j=1 " elt 
j#i j+i J# 


(a) Using Itô calculus, write a SDE for F(t). 
(b) By taking the expectation value of your equation, show that 


d mia l 1 
q7elROl=1-E} 5 z HOVEDE O51) 
i= 
(c) Inthe Wigner case, V’(x) = x, find the steady-state value of E[F>(t)] for any 
finite N. 
(d) For a random matrix X drawn from a generic potential V (x), show that in the 
large N limit, we have 


t[V/(X)X] = 1, (9.52) 


where Tt is the expectation value of the normalized trace defined by (2.1). 

(e) Show that this equation is consistent with t(W) = 1 for a Wishart matrix 
whose potential is given by Eq. (5.4). 

(f) In the large N limit, find a general expression for t[V’ (X)X*] by writing the 
steady-state equation for E[ Fx+1(t)]; you can neglect the It6 term. The first 
two should be given by 


t[V‘(X)X*] = 2r[X] and t[V/(X)X?] = 2r[X*] + t[X}°. (9.53) 


(g) Inthe unit Wigner case V’(x) = x, show that your relation in (f) is equivalent 
to the Catalan number inductive relation (3.27), with t(X?”) = Cm and 
TEL) =í) 


9.4.2 The Fokker-Planck Equation for DBM 


From the Langevin equation, Eq. (9.47), one can derive the Fokker-Planck equation 
describing the time evolution of the joint distribution of eigenvalues, P ({å;}, t). It reads 
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N 
oP 1 0 |ƏP 
= FP), 9.54 
ðt A í | C29) 

i= 
where we use P as an abbreviation for P({A;},7), and, for a quadratic confining potential 
V(A) = A?/2, a generalized force given by 


N 
= 1 NBA; 
Fi= By ae (0.55) 
j= . 
j#i 


The trick now is to introduce an auxiliary function W({A;},7), defined as 


| 


px BN Š 
P({ài}, t) := exp É X log |a; —àjl— a a W ({A;}. 0). (9.56) 


i, j=l i=1 


Then after a little work, one finds the following evolution equation for W:2 


N [a2 
ow 1 o-W 
= V;W |, 9.58 
a mad me 
i= 
with 
N 
BN? y_ B-A) 1 NB (,, BIN=) 
Vi({Aj) := Ae 1 . (9.59 
IAD TE J 2 mene a = (9.59) 
jži 
Looking for Wr such that 
a Wr 
— =-T Wy, 9.60 
a T (9.60) 
one finds that Wr is the solution of the following eigenvalue problem: 
N 2 
2 lo-Wr 1 
ViWr | =TWr. 9.61 
aof- SE sav] T (9.61) 


One notices that Eq. (9.61) is a (real) Schrodinger equation for N interacting “particles” in 
a quadratic potential, with an interacting potential that depends on the inverse of the square 
distance between particles. This is called the Calogero model, which happens to be exactly 
soluble in one dimension, both classically and quantum mechanically. In particular, the 


whole spectrum of this Hamiltonian is known, and given by? 


N 
N(N-1 
Penna, nw) = É En - > O<ny <n- <ny. (9.62) 


i=l 


2 One has to use, along the line, the following identity: 


1 1 
— > =. (9.57) 
igjek “POS ME ke 


3 Because Wr ({A;}) must vanish as two A’s coincide, one must choose the so-called fermionic branch of the spectrum. 


9.5 Non-Intersecting Brownian Motions and the Karlin-McGregor Formula 


The smallest eigenvalue corresponds to nı = 0,n2 = 1,...,2\y = N — 1 and is such that 
T = 0. This corresponds to the equilibrium state of the Fokker—Planck equation: 


px pN Š 
2 
Wo({à;}) = exp T. > log |Aj — Ajl — EJ > Aj |- (9.63) 
i, j=1 i=l 
j#i 


All other “excited states” have positive I'’s, corresponding to exponentially decaying 
modes (in time) of the Fokker—Planck equation. The smallest, non-zero value of T is such 
that ny = N, all other values of n; being unchanged. Hence T4 is equal to 6/2. 

In conclusion, we have explicitly shown that the equilibration time of the DBM in a 
quadratic potential is equal to 2/8. As announced at the end of the previous section, the 
density of eigenvalues indeed converges in a time of order unity. 

The case 6 = 2 is particularly simple, since the interaction term in V; disappears 
completely. We will use this result to show that, in the absence of a quadratic potential, the 
time dependent joint distribution of eigenvalues, P({A;},t), can be expressed, for B = 2, 
as a simple determinant: this is the so-called Karlin-McGregor representation, see next 
section. 


9.5 Non-Intersecting Brownian Motions and the Karlin-McGregor Formula 


Let p(y,t|x) be the probability density that a Brownian motion starting at x at t = 0 is 
at y at time t 


Gaye = cian (9.64) 
tix) = ex , $ 
PY = p oy 


where we set o? = 1. Note that p(y,t|x) obeys the diffusion equation 


aptX) _ 1% pQ.tlx) 
F 5 ay? ? (9.65) 


We now consider N independent Brownian motions starting at x = (x1,X2,... XN) 
att = 0, with x1 > x2 >... > xy. The Karlin-McGregor formula states that the 
probability Px m (y,t|x) that these Brownian motions have reached positions y = (y1 > 
y2 >... > yy) at time t without ever intersecting between 0 and ¢ is given by the 
following determinant: 


PY Dtl1) = pOrtl*2) --. portly) 
pPO2 tix) = p(ya.tlx2) ...  POztİxN) 

Pxm(y, tix) = . . z . (9.66) 
PON-t|x1) PON:tix2) «-- PON, tlxn) 


One can easily prove this by noting that the determinant involves sums of products of N 
terms p(y;,t|x;), each product TI involving one and only one y;. Each product I therefore 
obeys the multidimensional diffusion equation: 


N 2 
on 1 aT 
— = —. 9.67 
ot DD dy? Bon 


i=1 
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Being the sum of such products, Px (y, t|x) also obeys the same diffusion equation, as it 
should since we consider N independent Brownian motions: 

N 2 
dPKM(y,t|x) _ 1 > 0* PKM(Y, 1x) 


2 
a a i=l dy; 


(9.68) 


The determinant structure also ensures that Pgm(y,t|x) = 0 as soon as any two y’s are 
equal. Finally, because the x’s and the y’s are ordered, Pxm(y, t = 0|x) obeys the correct 
initial condition. 

Note that the total survival probability P(t; x) := f dy Pxm(y,f|x) decreases with time, 
since as soon as two Brownian motions meet, the corresponding trajectory is killed. In fact, 
one can show that P(t; x) ~ t NN-D/4 at large times. 

Now, the probability P(y,t|x) that these N independent Brownian motions end at y at 
time ¢ conditional on the fact that the paths never ever intersect, i.e. for any time between 
t = 0 and t = on, turns out to be given by a very similar formula: 


AY) 
Pty, t|x) = —"— Pxm(y, |x), (9.69) 
A(x) 
where A(x) is the Vandermonde determinant |]; < j (xj — x;) (and similarly for A(y)). 
What we want to show is that P (y,t|x) is the solution of the Fokker—Planck equation 
for the Dyson Brownian motion, Eq. (9.54), with 6 = 2 and in the absence of any con- 
fining potential. Indeed, as shown above, Pxyy(y.t|x) obeys the diffusion equation for N 
independent Brownian motions with the annihilation boundary condition Pym (y, t|x) = 0 
when y; = yj for any given pair i, j. 
Now compare with the definition Eq. (9.56) of W for the Dyson Brownian motions 
with 6 = 2 and without any confining potential: 


N 

ici 

P({Aj},0) := exp f 5 log |à; —Ajl | WARA = ADW Aih, t). (9.70) 
j=1 


kz 


From Eq. (9.58) we see that in the present case W({A;}, t) also obeys the diffusion equation 
for N independent Brownian motions. Since P({A;},0) ~ |Aj—A; |2 when two eigenvalues 
are close, we also see that W({A;},¢) vanishes linearly whenever two eigenvalues meet, 
and therefore obeys the same boundary conditions as Pxy(y,t|x) with y; = Àj. 

The conclusion is therefore that the Dyson Brownian motion without external forces 
is, for 8 = 2, equivalent to N Brownian motions constrained to never ever intersect. 
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10 
Addition of Large Random Matrices 


In this chapter we seek to understand how the eigenvalue density of the sum of two large 
random matrices A and B can be obtained from their individual densities. In the case where, 
say, A is a Wigner matrix X, the Dyson Brownian motion formalism of the previous chapter 
allows us to swiftly answer that question. We will see that a particular transform of the 
density of B, called the R-transform, appears naturally. We then show that the R-transform 
appears in the more general context where the eigenbases of A and B are related by a 
random rotation matrix O. In this case, one can construct a Fourier transform for matrices, 
which allows us to define the analog of the generating function for random variables. As 
in the case of nD random variables, the logarithm of this matrix generating function is 
additive when one adds two randomly rotated, large matrices. The derivative of this object 
turns out to be the R-transform, leading to the central result of the present chapter (and 
of the more abstract theory of free variables, see Chapter 11): the R-transform of the 
sum of two randomly rotated, large matrices is equal to the sum of R-transforms of each 
individual matrix. 


10.1 Adding a Large Wigner Matrix to an Arbitrary Matrix 


Let M; = Mo + X; be the sum of a large matrix Mọ and a large Wigner matrix X;, such 
that the variance of each element grows as t. This defines a Dyson Brownian motion as 
described in the previous chapter, see Eq. (9.11). We have shown in Section 9.3.1 that in 
this case the Stieltjes transform g;(z) of M; satisfies the Burgers’ equation: 


Oar (Zz agez 

G1 (z) Segl dz ( ) (10.1) 
or Oz 

with initial condition go(z) := 9m,(z). We now proceed to show that the solution of 


this Burgers’ equation can be simply expressed using an Mo dependent function: its R- 
transform. 
Using the so-called method of characteristics, one can show that 


Gr (z) = go(z — tar (Z)). (10.2) 
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If the method of characteristics is unknown to the reader, one can verify that (10.2) indeed 
satisfies Eq. (10.1) for any function go(z). Indeed, let us compute 0; 9; (z) and 0,9;(z) using 
Eq. (10.2): 


Gt (Z)G(Z — t9: (2)) 
1 + tag(z — tar(z))’ 


9191 (Z) = Go (Zz — 197 (z)) [—94(z) — 19/9; (Z)] => drg (z) = (10.3) 


and 


W — tar(z)) 
1+ toz — tar(z))’ 


dzd (2) = WE — tar (z)) [1 — t8-91(z)] => 2&9) = (10.4) 


such that Eq. (10.1) is indeed satisfied. 


Example: Suppose Mo = 0. Then we have go(z) = z~!. Plugging into (10.2), we obtain 
that 


g:(z) = (10.5) 


Z—ta;(z)’ 


which is the self-consistent Eq. (2.35) in the Wigner case with o? = t. Indeed, if we start 
with the zero matrix, then M, = X; is just a Wigner with parameter o? = t. 

Back to the general case, we denote as 3;(g) the inverse function! of g;(z). Now fix 
g = 9 (z) = 90(z — tg) and z = 3;(g), we apply the function 30 to g and get 


30(g) =z — tg = 3:(g) — t8, 
3 (8) = 30(g) + t8. (10.6) 


The inverse of the Stieltjes transform of M, is given by the inverse of that of Mo plus a 
simple shift tg. If we know go(z) we can compute its inverse 39(g) and thus easily obtain 
3: (e), which after inversion hopefully recovers g;(z). 


Example: Suppose Mo is a Wigner matrix with variance o7. We first want to compute 
the inverse of go(z); to do so we use the fact that go(z) satisfies Eq. (2.35), and we get that 


1 
30(g) = 0°g + = (10.7) 
Then, by (10.6), we get that 
2 1 
w6) = 068) tig = (0? +i)s +> (10.8) 


which is the inverse Stieltjes transform for Wigner matrices with variance o? + t. In other 
words g (z) satisfies the Wigner equation (2.35) with o? replaced by o? + t. This result is 
not surprising, each element of the sum of two Wigner matrices is just the sum of Gaussian 
random variables. So M; is itself a Wigner matrix with the sum of the variances as its 
variance. 


1 We will discuss in Section 10.4 the invertibility of the function 9(z). 
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We can now tackle the more general case when the initial matrix is not necessarily 
Wigner. Call B = M; and A = Mo. Then by (10.6), we get 


1 
3B(g) = 3a(g) + tg = 3a(g) + 3x, (8) — P (10.9) 


fe 


We now define the R-transform as 
1 
R(g) := 3(g) — rs (10.10) 
Note that the R-transform of a Wigner matrix of variance f is simply given by 


Rx(g) = tg. (10.11) 


This definition allows us to rewrite Eq. (10.9) above as a nice additive relation between 
R-transforms: 


Rp(g) = Ra(g) + Rx, (8). (10.12) 


In the next section we will generalize this law of addition to (large) matrices X that are 
not necessarily Wigner. The R-transform will prove to be a very powerful tool to study 
large random matrices. Some of its properties are left to be derived in Exercises 10.1.1 and 
10.1.2 and will be further discussed in Chapter 15. We finish this section by computing 
the R-transform of a white Wishart matrix. Remember that its Stieltjes transform satisfies 
Eq. (4.37), ie. 


qz% — (z-1+q4)g+1=0, (10.13) 


which can be written in terms of the inverse function 3(g): 
1 


3(g) = +-. (10.14) 
I-—qg 8 
From which we can read off the R-transform: 
Rw(g) = : (10.15) 
1— q8 
Exercise 10.1.1 Taylor series for the R-transform 
Let g(z) be the Stieltjes transform of a random matrix M: 
à)dà 
gz) =T (&1-m') = L (10.16) 
supple} 2 — A 


We saw that the power series of g(z) around z = ov is given by the moments of 
M (mn := T(M")): 
< m 
n . a 
g(z) = 2 sat With mo = 1. (10.17) 
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Call 3(g) the functional inverse of g(z) which is well defined in a neighborhood 
of g = 0. And define R(g) as 


R(g) = a(g) — 1/8. (10.18) 


(a) By writing the power series of R(g) near zero, show that R(g) is regular at 
zero and that R(O) = mı. Therefore the power series of R(g) starts at oe 


CO 
RG) — eg (10.19) 
n=l 
(b) Now assume mı = k; = 0 and compute K2, «3 and «4 as a function of m2, m3 
and m4 in that case. 


Exercise 10.1.2 Scaling of the R-transform 
Using your answer from Exercise 2.3.1: If A is a random matrix drawn from 
a well-behaved ensemble with Stieltjes transform gą (z) and R-transform Ra (8), 
what is the R-transform of the random matrices wA and A + b1 where a and b 
are non-zero real numbers? 
Exercise 10.1.3 Sum of symmetric orthogonal and Wigner matrices 
Consider as in Exercise 1.2.4 a random symmetric orthogonal matrix M and a 
Wigner matrix X of variance ø?. We are interested in the spectrum of their sum 
E=M+X. 


(a) Given that the eigenvalues of M are +1 and that in the large N limit each 
eigenvalue appears with weight l, write the limiting Stieltjes transform gm (z). 

(b) E can be thought of as undergoing Dyson Brownian motion starting at E(0) = 
M and reaching the desired E at t = 07. Use Eq. (10.2) to write an equation 
for gg (z). This will be a cubic equation in g. 

(c) You can obtain the same equation using the inverse function 3m(g) of your 
answer in (a). Show that 


(eo ae 
2g 
where one had to pick the root that makes 3(g) ~ 1/g near g = 0. 

(d) Using Eq. (10.6), write zg (g) and invert this relation to obtain an equation for 
gE(z). You should recover the same equation as in (b). 

(e) Eigenvalues of E will be located where your equation admits non-real 
solutions for real z. First look at z = 0; the equation becomes quadratic after 
factorizing a trivial root. Find a criterion for ø? such that the equation admits 
non-real solutions. Compare with your answer in Exercise 1.2.4 (b). 

(f) Ato = 1, the equation is still cubic but is somewhat simpler. A real cubic 
equation of the form ax? + bx? + cx + d = 0 will have non-real solutions 
iff A < 0 where A = 18abcd — 4b°d + b?c* — 4ac? — 27a7d*. Using this 


3m(g) = : (10.20) 
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criterion show that for o? = 1 the edges of the eigenvalue spectrum are given 
by à = +3./3/2 ~ +2.60. 

(g) Again at ø? = 1, the solution near g(0) = 0 can be expanded in fractional 
powers of z. Show that we have 


g(z) = z! +O0(z), which implies p(x) = ae Axl, (10.21) 


for x near zero. 

(h) For o* = 1/2,1 and 2, solve numerically the cubic equation for gg(z) for 
z = x real and plot the density of eigenvalues p(x) = | Im(gr(x))|/z for one 
of the complex roots if present. 


10.2 Generalization to Non-Wigner Matrices 
10.2.1 Set-Up 


In the previous section, we derived a formula for the Stieltjes transform of the sum of a 
Wigner matrix and an arbitrary matrix. We would like to find a generalization of this result 
to a larger class of matrices. 

Take two N x N matrices: A, with eigenvalues {A;}1<;<y and eigenvectors {vi }i<j<n, 
and B, with eigenvalues {j;}1<;<y and eigenvectors {u;}i<j<,. Then the eigenvalues of 
C = B +A will in general depend in a complicated way on the overlaps between the 
eigenvectors of B and the eigenvectors of A. In the trivial case where v; = u; for all i, we 
have that the eigenvalues of B + A are simply given by v; = A; + ui. However, this is 
neither generic nor very interesting. 

One important property of Wigner matrices is that their eigenvectors are Haar dis- 
tributed, that is, the matrix of eigenvectors is distributed uniformly in the group O(N) and 
each eigenvector is uniformly distributed on the unit sphere SY~!. Thus, when N is large, 
it is very unlikely that any one of them will have a significant overlap with the eigenvectors 
of B. This is the property that we want to keep in our generalization. We will study what 
happens for general matrices B and A when their eigenvectors are random with respect 
to one another. We will define this relative randomness notion (called “freeness”) more 
precisely in the next chapter. Here, to ensure the randomness of the eigenvectors, we will 
apply a random rotation to the matrix A and define the free addition as 


C =B + OAO’, (10.22) 


where O is a Haar distributed random orthogonal matrix. Then it is easy to see that 
OAO” is rotational invariant since O'O is also Haar distributed for any fixed O’ € O(N). 
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10.2.2 Matrix Fourier Transform 


We saw in Section 8.1 that the function Hx (t) = log E exp(it X) is additive when one adds 
independent scalar variables. When X is a matrix, it is plausible that t should also be a 
matrix T, but in the end we need to take the exponential of a scalar, so a possible candidate 
would be 


I(X,T) := (exp E Tr Toxo" )) ; (10.23) 
(0) 
The notation (-)g means that we average over all orthogonal matrices O (with a flat weight) 
normalized such that (1)g = 1. This defines the Haar measure on the group of orthogonal 
matrices. Equation (10.23) defines the so-called Harish-Chandra—Itzykson—Zuber (HCIZ) 
integral.2 Note that by definition, /(0;XO7,T) = /(X, O,TO;}) = 1(X%,T) for an 
arbitrary rotation matrix O;. This means that 7 (X, T) only depends on the eigenvalues of 
X and T. 
Now consider C = B+O AO} with a random Oj. For large matrix sizes, the eigenvalue 
spectrum of C will turn out not to depend on the specific choice of O1, provided it is chosen 
according to the Haar measure. Therefore, one can average /(C,T) over O; and obtain 


I(C,T) = (exp E Tr TO(B + 01A0/)0")) = 1(B,T)/(A,T), (10.25) 
0,0; 
where we have used that OO; = O’ is a random rotation independent from O itself. Hence 
we conclude that log Z is additive in this case, as is the logarithm of the characteristic 
function in the scalar case. 

For a general matrix T, the HCIz integral is quite complicated, as will be further dis- 
cussed in Section 10.5. Fortunately, for our purpose we can choose the “Fourier” matrix T 
to be rank-1 and in this case the integral can be computed. A symmetric rank-1 matrix can 
be written as 


T=twv', (10.26) 


where f is the eigenvalue and v is a unit vector. We will show that the large N behavior of 
I(T, B) is given, in this case, by 


1(T,B) © exp (7mo) l (10.27) 


for some function Hpg (t) that depends on the particular matrix B. 


2 The nciz can be defined with an integral over orthogonal, unitary or symplectic matrices. In the general case it is defined as 


1g(X,T) = (ex (= Txoro) : (10.24) 
o 


with beta equal to 1, 2 or 4 and O is averaged over the corresponding group. The unitary 6 = 2 case is the most often studied, 
for which some explicit results are available. 


142 Addition of Large Random Matrices 


More formally we define 
$e tN 7 r 
Apg(t) = lim — log(exp | — Trvv OBO ; (10.28) 
N=>œ N 2 o 
If C = B + A where A is randomly rotated with respect to B, the precise statement is that 


Hc(t) = Hg(t) + Ha (t), (10.29) 


i.e. the function H is additive. We now need to relate this function to the R-transform 
encountered in the previous section. 


10.3 The Rank-1 acız Integral 


To get a useful theory, we need to have a concrete expression for this function Hg. Without 
loss of generality, we can assume B is diagonal (in fact, we can diagonalize B and absorb 
the eigenmatrix into the orthogonal matrix O we integrate over). Moreover, for simplicity 
we assume that ¢ > 0. Then O”TO can be regarded as proportional to a random projector: 


O'TO = vv", (10.30) 


with ||} ||? = t and w/||y|| uniformly distributed on the unit sphere. Then we make a 
change of variable y > w/ /N, and calculate 


avy 2 1 T 
Z,(B) = f Omna? (iv — Nt) exp (5¥ By). (10.31) 


where we have added a factor of (277)~%/? for later convenience. Because Z,(B) is not 
properly normalized (i.e. Z;(0) # 1), we will need to normalize it to compute Z (T, B): 


(exp G Troso” ) = fr m, (10.32) 
0) t 


10.3.1 A Saddle Point Calculation 


We can now express the Dirac delta as an integral over the imaginary axis: 


œo e7izx ioo @—2x/2 
"o A m J. in 


Now let A be a parameter larger than the maximum eigenvalue of B: A > Amax(B). We 


introduce the factor 
A 2 Nt 
| hdl ) 


since ||w||? = Nt. Then, absorbing A into z, we get that 
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A+io00 d d¥ 1 N 
Z,(B) = f - p( z% C B)y + =) : (10.33) 


Avion ir) Ome” 2 


We can now perform the Gaussian integral over the vector 4: 


A+i00 d Nzt 
Z;(B) al £ det (z — B)7!/2 exp (=) 


A-ioo 47 


A+i00 dz N 1 
= ce mo l 5 (: = dost = næ») , (10.34) 


where à (B), 1 < k < N, are the eigenvalues of B. Then we denote 


1 
F,(z,B) := zt — x 2 log(z — Ax (B)). (10.35) 


The integral in (10.34) is oscillatory, and by the stationary phase approximation (see 
Appendix A.1), it is dominated by the point where 


0.F.(z,B) =0=>1 DD : =t- g8(z) =0 (10.36) 
zl, z N &z-mB) &NK) 5U. 5 


If gÈ) can be inverted then we can express z as 3(t). For x > Amax, ge (x) is mono- 
tonically decreasing and thus invertible. So for t < gR (Amax), a unique 3(t) exists and 
3(t) > Amax (see Section 10.4). Since F;(z,B) is analytic to the right of z = Amax, we 
can deform the contour to reach this point (see Fig. 10.1). Using the saddle point formula 
(Eq. (A.3)), we have 


Imz 


Rez 


Figure 10.1 Graphical representation of the integral Eq. (10.34) in the complex plane. The crosses 
represent the eigenvalues of B and are singular points of the integrand. The integration is from A —ioo 
to A + i00 where A > Amax. The saddle point is at z = 3(t) > Amax. Since the integrand is analytic 
right of Amax, the integration path can be deformed to go through 3(f). 
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VAn / (40) N 1 
Z,(B) NAFO DIA] 3 3(t)t N 27 EGO) = NE) 


1 N 1 
Be exp l 5 (or N X log(a(2) — næ») . (10.37) 
2. / NT |g GO)I k 
For the case B = 0, we have gg(z) = z~! => 3(t) = t7!, so we get 


Z0) ~ exp É (1 + log n| 3 (10.38) 


1 
2t / NT 
In the large N limit, the prefactor in front of the exponential does not contribute to Hpg (t) 
and we finally get 


2 N 1 
aim, W log (exp E Trogo’ )) = 3(t)t — 1 — logt — X 2 log(3(t) — àg (B)). 


(10.39) 
By the definition (10.28), we then get that 


Ag(t) = H(t),t), H(z,t) := zt — 1 — logt — 2 log(z — Ax (B)). (10.40) 


10.3.2 Recovering R-Transforms 


We found an expression for Hg (t) but in a form that is not easy to work with. But note 
H (z,t) comes from a saddle point approximation and therefore its partial derivative with 
respect to z is zero: 0-H (3(t),t) = 0. This allows us to compute a much simpler expression 
for the derivative of Hg(t): 


dHp(t) _ JH d3(t) 
dt ——s az GO. dt 


where Rp(t) denotes the R-transform defined in (10.10) (we have used the very definition of 
3(t) from the previous section). Moreover, from its definition, we trivially have Hg (0) = 0. 
Hence we can write 


JH JH 1 
t a ED) = OU 3) = BB), (10.41) 


t 
Hp(t) := | Rg(x)dx. (10.42) 
0 


We already know that H is additive. Thus its derivative, i.e. the R-transform, is also 
additive: 


Rc(t) = Rg(t) + Ra (t), (10.43) 


as is the case when A is a Wigner matrix. This property is therefore valid as soon as A is 
“free” with respect to B, i.e. when the basis that diagonalizes A is a random rotation of the 
basis that diagonalizes B. 
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The discussion leading to Eq. (10.42) can be extended to the HcIz integral (Eq. (10.23)), 
when the rank of the matrix T is very small compared to N. In this case we get? 


I(T, B) © exp (3 x ma) = exp E Tr mo) , (10.45) 
i=l 


where t; are the n non-zero eigenvalues of T and with the same Hpg (t) as above. When T 
has rank-1 we recover that Tr Hg(T) = Hg(t), where ¢ is the sole non-zero eigenvalue 
of T. 

The above formalism is based on the assumption that g(z) is invertible, which is gen- 
erally only true when tf = g(z) is small enough. This corresponds to the case where z is 
sufficiently large. Recall that the expansion of g(z) at large z has coefficients given by the 
moments of the random matrix by (2.22). On the other hand, the expansion of H (t) around 
t = 0 will give coefficients called the free cumulants of the random matrix, which are 
important objects in the study of free probability, as we will show in the next chapter. 


10.4 Invertibility of the Stieltjes Transform 


The question of the invertibility of the Stieltjes transform arises often enough that it is 
worth spending some time discussing it. In Section 10.1, we used the inverse of the lim- 
iting Stieltjes transform g(z) to solve Burgers’ equation, which led to the introduction 
of the R-transform R(g) = 3(g) — 1/g. In Section 10.3.1 we invoked the invertibil- 
ity of the discrete Stieltjes transform gy (z) to compute the rank-1 HCIz integral. 


10.4.1 Discrete Stieltjes Transform 


Recall the discrete Stieltjes transform of a matrix A with N eigenvalues {Ax}: 


N 


1 1 
A 
== ; 10.4 
NOFO (10.46) 


This function is well defined for any z on the real axis except on the finite set {Ag}. For 
Z > Àmax, each of the terms in the sum is positive and monotonically decreasing with z so 
gA (z) is a positive monotonically decreasing function of z. As z —> on, gA (z) > 0. By 
the same argument, for z < Amin, gn (z) is a negative monotonically decreasing function 
of z tending to zero as z goes to minus infinity. Actually, the normalization of gs (z) is such 
that we have 


3 The same computation can be done for any value of beta, yielding 
NBS NB 
1g (T,B) ~ exp 2 È = exp (Z Trp). (10.44) 
i= 


where 7g (T, B) is defined in the footnote on page 141 and T has low rank. 
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Figure 10.2 (left) A particular gh (z) for A a Wigner matrix of size N = 5 shown for real values of 
z. The gray areas left of Amin and right of Amax show the values of z for which it is invertible. (right) 
The inverse function 3(g). Note that g(z) behaves as 1/z near zero and tends to Amin and Amax as g 
goes to plus or minus infinity respectively. 


1 1 
gn (2) = +0 (=) when |z| > œ. (10.47) 
For large |z], gs (z) is thus invertible and its inverse behaves as 
1 
3(g) = — + regular terms when |g|— 0. (10.48) 
8 


If we consider values of gA (z) for z > Amax, we realize that the function takes all possible 
positive values once and only once, from the extremely large (near z = Amax) to almost zero 
(when z —> oo). Similarly, all possible negative values are attained when z € (—00, Amin) 
(see Fig. 10.2 left). We conclude that the inverse function 3(g) exists for all non-zero values 
of g. The behavior of gh (z) near Amin and Amax gives us the asymptotes 


lim 3(g)=Amin and lim 3(g) = Amax- (10.49) 
8> - CO goo 


10.4.2 Limiting Stieltjes Transform 


Let us now discuss the inverse function of the limiting Stieltjes transform g(z). The limiting 
Stieltjes transform satisfies Eq. (2.41), which we recall here: 


g) = i me (10.50) 
supp{p} < — * 
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where (A) is the limiting spectral distribution and may contain Dirac deltas. We denote 
A+ the edges of the support of o. We have that for z > à+, 9(z) is a positive, monotonically 
decreasing function of z. Similarly for z < A_, g(z) is a negative, monotonically decreasing 
function of z. From the normalization of po (à), we again find that 


1 1 
g(z) =-+0 (=) when |z| > œ. (10.51) 
Z z 


Using the same arguments as for the discrete Stieltjes transform, we have that the inverse 
function 3(g) exists for small arguments and behaves as 


1 
3(g) = — + regular terms when |g|— 0. (10.52) 
8 


The behavior of g(z) at A+ can be different from that of gy (z) at its extreme eigenval- 
ues. The points à+ are singular points of g(z). If the density near à} goes to zero as 
pà) ~ Ay — 2)? for some 6 > 0 (typically 0 = 1/2) then the integral (10.50) converges 
atz = Ay and g, := g(à+) is a finite number. For z < 4+ the function g(z) has a branch 
cut and is ill defined for z on the real axis. The point z = A is an essential singularity of 
g(z). The function is clearly no longer invertible for z < à+. Similarly, if o(A) grows as a 
positive power near A_, then g(z) is invertible up to the point g_ := g(A_). 

If the density o(A) does not go to zero at one of its edges (or if it has a Dirac delta), the 
function g(z) diverges at that edge. We may still define g+ = lim,-,,, 9(z) if we allow g+ 
to be plus or minus infinity. 

In all cases, the inverse function 3(g) exists in the range g_ < g < g+, with the property 


3(g4) = Ad. (10.53) 


In the unit Wigner case, we have A+ = +2 and g4 
only exists between —1 and 1 (see Fig. 10.3). 


t1 and the inverse function 3(g) 


10.4.3 The Inverse Stieltjes Transform for Larger Arguments 


In some computations, as in the HCIz integral, one needs the value of 3(g) beyond g+. 
What can we say then? First of all, one should not be fooled by spurious solutions of the 
inversion problem. For example in the Wigner case we know that g(z) satisfies 


1 
g+—-z=0, (10.54) 
& 
so we would be tempted to write 
1 
1g) = et F (10.55) 
for all g. But this is wrong as g + 1/g is not the inverse of g(z) for |g| > 1 (Fig. 10.3). 


The correct way to extend 3(g) beyond g+ is to realize that in most computation, we 
use g(z) as an approximation for gy (z) for very large N. For z > A+ the function gy (z) 
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3(g) 


Figure 10.3 (left) The limiting function g(z) for a Wigner matrix, a typical density that vanishes at 
its edges. The function is plotted against a real argument. In the white part of the graph, the function 
is ill defined and it is shown here for a small negative imaginary part of its argument. In the gray 
part (z < A— and z > A+) the function is well defined, real and monotonic. It is therefore invertible. 
(right) The inverse function 3(g) only exists for g— < g < g+ and has a 1/g singularity at zero. The 
dashed lines show the extension of 3(g) to all values of g that are natural when we think of g(z) as 
the limit of gy (z) with maximal and minimal eigenvalues 4+. The dotted lines indicate the wrong 
branch of the solution of 3(g) = g + 1/g. 


converges to g(z) and this approximation can be made arbitrarily good for large enough 
N. On the other hand we know that on the support of p, gy (z) does not converge to g(z). 
The former has a series of simple poles at random locations, while the later has typically 
a branch cut. 

At large but finite N, there will be a maximum eigenvalue Amax. This eigenvalue is 
random but to dominant order in N it converges to A+, the edge of the spectrum. For z 
above but very close to A+ we should think of gy (z) as 


1 
Z= Àj 


1 1 1 
gn (z) © g(z) + on ee a(z) + T (10.56) 


Z — Amax 


Because 1/N goes to zero, the correction above does not change the limiting value of g(z) 
at any finite distance from à+. On the other hand, this correction does change the behavior 
of the inverse function 3(g). We now have 


lim gy(z)—> oo and 3(g)=A4 forg > gy. (10.57) 
ZA 


For negative z and negative g, the same argument follows near A_. We realize that, while 
the limiting Stieltjes transform g(z) loses all information about individual eigenvalues, 
its inverse function 3(g), or really the large N limit of the inverse of the function gy (z), 
retains information about the smallest and largest eigenvalues. In Chapter 14 we will study 
random matrices where a finite number of eigenvalues lie outside the support of p. In the 
large N limit, these eigenvalues do not change the density or g(z) but they do show up in 
the inverse function 3(g). 
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Let us define Zpulk (g), the inverse function of g(z) without considering extreme eigen- 
values or outliers. In the presence of outliers we have Amax > A+ and gmax := 9(Amax) < 
g(A+) and similarly for gmin. With arguments similar to those above we find the following 
result for the limit of the inverse of gy (z): 


Amin for g < gmin, 
3(8) = 9 Zbulk(g) fOr gmin < 8 < gmax» (10.58) 
Amax for g > gmax- 


In the absence of outliers the result still applies with max and min (extreme eigenvalues) 
replaced by + and — (edge of the spectrum), respectively. 


10.4.4 Large t Behavior of I; 


Now that we understand the behavior of 3(g) for larger arguments we can go back to 
our study of the rank-1 cız integral. There is indeed an apparent paradox in the result 
of our computation of /;(B). For a given matrix B there are two immediate bounds to 
I(B) = Z;(B)/Z;(0): 


Nthmi Nthmz 
exp (=) < I(B) < exp ( sat) 4 (10.59) 


where Amin and Amax are the smallest and largest eigenvalues of B, respectively. Focusing 
on the upper bound, we have 


Ap(t) < tAmax. (10.60) 
On the other hand, the anti-derivative of the R-transform for a unit Wigner matrix reads 
12 
Rw(t) =t — Aw(t) = 7’ (10.61) 


whereas Amax — à+ = 2. One might thus think that the quadratic behavior of Hw (t) 
violates the bound (10.60) for t > 4. We should, however, remember that Eq. (10.42) is in 
fact only valid for tf < g+, the value at which g(z) ceases to be invertible. In the absence 
of outliers, g+ = g(A+). For a unit Wigner this point is g+ = 9(2) = 1; the bound is 
not violated. Fort > g+, one can still compute Hpg (t) but the result depends explicitly 
on Amax: 

Now that we understand the behavior of 3(g) for larger arguments, including in the 
presence of outliers, we can extend our result for Hg (t) for large t’s. We just need to use 
Eq. (10.58) into Eq. (10.41): 


dHp(t) | RBO fort < gmax ‘= 9(max), 


= 10.62 
dt Amax —1/t fort > gmax, ( ) 


where the largest eigenvalue Amax can be either the edge of the spectrum à+ or a true 
outlier. We will show in Section 13.3 how this result can also be derived for Wigner 
matrices using the replica method. 


10.5 The Full-Rank HCIZ Integral 


We have defined in Eq. (10.23) the HcIz integral as a generalization of the Fourier transform 
for matrices, and have seen how to evaluate this integral in the limit N — oo when one 
matrix is of low rank. A generalized HcIz integral 7g (A, B) can be defined as 
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1p(A,B) = f dde a TAUBU! (10.63) 
G(N) 

where the integral is over the (flat) Haar measure of the compact group U € G(N) = 
O(N),U(N) or Sp(N) in N dimensions and A,B are arbitrary N x N symmetric (resp. 
Hermitian or symplectic) matrices, with, correspondingly, 6 = 1,2 or 4. Note that by 
construction 7g (A, B) can only depend on the eigenvalues of A and B, since any change of 
basis on B (say) can be reabsorbed in U, over which we integrate. Note also that the Haar 
measure is normalized, i.e. J G(N) dU=1. 

Interestingly, it turns out that in the unitary case G(N) = U(N) (6 = 2), the HCIz 
integral can be expressed exactly, for all N, as the ratio of determinants that depend on 
A,B and additional N-dependent prefactors. This is the Harish-Chandra—Itzykson—Zuber 
celebrated result, which cannot be absent from a book on random matrices: 


cy det (eN”*/) 
N(N2-N)/2 A(A)A(B)’ 


with {v;}, {A;} the eigenvalues of A and B, A(A), A(B) are the Vandermonde determinants 
of A and B, and cy = [|]! @!. 

Although this result is fully explicit for 8 = 2, the expression in terms of determinants is 
highly non-trivial and quite tricky. For example, the expression becomes degenerate (0/0) 
whenever two eigenvalues of A (or B) coincide. Also, as is well known, determinants 
contain N! terms of alternating signs, which makes their order of magnitude very hard 
to estimate a priori. The aim of this technical section is to discuss how the HcIz result can 
be obtained using the Karlin-—McGregor equation (9.69). We then use the mapping to the 
Dyson Brownian motion to derive a large N approximation for the full-rank HcIz integral 
in the general case. 


h (A,B) = 


(10.64) 


10.5.1 acız and Karlin-McGregor 


In order to understand the origin of Eq. (10.64), the basic idea is to interpret the HCIZ 
integrand in the unitary case, exp[N Tr AUBUÏ], as a part of the diffusion propagator in 
the space of Hermitian matrices, and use the Karlin-McGregor formula. 

Indeed, adding to A a sequence of infinitesimal random Gaussian Hermitian matrices 
of variance dt/N, the probability to end up with matrix B in a time t = 1 is given by 


P(BIA) x NN?/2 e7N/2 THB-A)? (10.65) 


where we drop an overall normalization constant in our attempt to understand the structure 
of Eq. (10.64). The corresponding eigenvalues follow a Dyson Brownian motion, namely 


i Tees Ge 
dx; = ,| —dB; , 10.66 
i=4y Way (10.66) 


j#i 


with x; (t = 0) = v; and x; (t = 1) = A;. Now, for 6 = 2 we can use the Karlin-McGregor 
equation (9.69) to derive the conditional distribution of the {A;}, given by 
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P(t) = ĈE Pat = 119) (10.67) 
{vi} = „t = 1|V), ; 
i i A(A) 


where P(A,t = 1|¥) is given by a determinant, Eq. (9.66). With the present normalization, 
this determinant reads 


7 . N\N/2 
PÈ t= 10) = (=) e7 2 (TrA?+TrB?) det (e2) ! (10.68) 
JT 


Now, the distribution of eigenvalues of B can be computed from Eq. (10.65). First we 
make P(B|A) unitary-invariant by integrating over U (N): 


dU e—N/2 Tr(UBUT—A)? 


P(BIA) > POBIA) = N22 Juw 
Qy 
L yN?/2e- He A?+ THB) nA , (10.69) 
N 


where Qy = f U(N) dU is the “volume” of the unitary group U (N). This new measure, 
by construction, only depends on {A;}, the eigenvalues of B. Changing variables from B 
to {A;} introduces a Jacobian, which in the unitary case is the square of the Vandermonde 
determinant of B, A? (B). We thus find a second expression for the distribution of the {A;}: 


Pas (vi)  NN?/2 02 Bye FTA TB?) 7 (A,B). (10.70) 


Comparing with Eqs. (10.67) and (10.68) we thus find 


det (en way) 


(N—N?)/2 
h(A,B) « N DAB 


(10.71) 


which coincides with Eq. (10.64), up to an overall constant cy which can be obtained 
by taking the limit A = 1, i.e. when all the eigenvalues of A are equal to 1. The limit is 
singular but one can deal with it in a way similar to the one used by Brézin and Hikami 
to go from Eq. (6.65) to (6.67). In this limit, the right hand side of Eq. (10.71) reads 
exp(N Tr B)/cy, while the left hand side is trivially equal to exp(N Tr B). Hence a factor 
cy is indeed missing in the right hand side of Eq. (10.71). 

Equation (10.64) can also be used to obtain an exact formula for the rank-1 HcIz 
integral (when 6 = 2). The trick is to have one of the eigenvalues of v; equal to some 
non-zero number ¢ and let the N — 1 others go to zero. The limit can again be dealt with 
in the same way as Eq. (6.65). One finally finds 


N Nthj 
(N — 1)! ete] 
h(t,B) = i (10.72) 
(NON! 2 Trej Qj — Ax) 
The above formula may look singular at t = 0, but we have lim,_,9 y(¢,B) = 1 as 


expected. 


10.5.2 HcIz at Large N: The Euler—Matytsin Equations 


We now explain how J (A,B) can be estimated for large matrix size, using a Dyson 
Brownian representation of P(B|A), Eq. (10.66). In terms of these interacting Brownian 
motions, the question is how to estimate the probability that the x; (t) start at x; (t = 0) = 
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v; and end at x; (t = 1) = àj, when their trajectories are determined by Eq. (10.66), which 
we rewrite as 


1 1 
dxi = yf 57 4Bi — ôx V dt, V (xi) = -7 2 nla = xjl. (10.73) 


i<j 
The probability of a given trajectory for the N Brownian motions between time t = 0 and 
time ¢ = 1 is then given byf 


1 
P({x;(t)}) = Z7! exp — zf dt Y (à; + x, V)? = Zole MS, (10.74) 


where Z is a normalization factor that we will not need explicitly. Expanding the square 
as xe + 28x; V ži + (Ox; V)2, one can decompose S = S1 + Sp into a total derivative term 
equal, in the continuum limit, to boundary terms, i.e. 


C=B 


1 
Sy = -3 [f dxdypc¢ (x) pc (y) ln |x — J (10.75) 
C=A 
and 
S32 L fay [eee vy| (10.76) 
ae IN Jy 2 i xj ¿ : 


We now look for the “instanton” trajectory that contributes most to the probability P for 
large N, in other words the trajectory that minimizes Sj. This extremal trajectory is such 
that the functional derivative of Sy with respect to all x; (t) is zero: 


dx 


—2 
dt? 


N 
L42932 v Vðx V = 0, (10.77) 
l=1 


which leads, after a few algebraic manipulations, to 


dxi 2 1 
= DD ; (10.78) 


This can be interpreted as the motion of unit mass particles, accelerated by an attractive 


force that derives from an effective two-body potential g(r) = —(N r)~?. The hydrody- 
namical description of such a fluid, justified when N — oo, is given by the Euler equations 
for the density field p (x,t) and the velocity field v(x,t): 


Or P(x, t) + Oxo (x, t)v(x,t)] =0 (10.79) 


and 


dv(x,t) + v(x, Hdxv(x,t) = — 


1 
aD dx (x,t), (10.80) 


where I(x, +t) is the pressure field, which reads, from the “virial” formula for an interact- 
ing fluid at temperature T,’ 


4 We neglect here a Jacobian which is not relevant to compute the leading term of J (A, B) in the large N limit. 
5 See e.g. Le Bellac et al. [2004], p. 138. 
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1 
(xj — xg)?" 


1 p 
T= pT — 50D i -xelo (xi =x) ~ aa (10.81) 


l#i C#i 
because the fluid describing the instanton is at zero temperature, T = 0. Now, writing 
2 
xi — xe © (i — £)/(Np) and rr, n? = =, one finally finds 


M(x.) = Fp’, (10.82) 


Equations (10.79) and (10.80) for p and v with TI given by (10.82) are called the Euler- 
Matytsin equations. They should be solved with the following boundary conditions: 


p(x,t = 0) = par); p(x,t) = pax); (10.83) 


the velocity field v(x,t = 0) should be chosen such that these boundary conditions are 
fulfilled. 

Expressing Sz in terms of the solution of the Euler—Matytsin equations gives, in the 
continuum limit, 


2 
59(A,B) © 5 [wen pre a Tean) (10.84) 


Hence, the probability P({A;}|{v;}) to observe the set of eigenvalues {A;} of B for a 
given set of eigenvalues v; for A is, in the large N limit, proportional to exp[—N2(S, + 
S2)]. Comparing with Eq. (10.70), we get as a final expression for F7(A,B) := 
— limy oœ N~2 In D (A,B): 


3 1 
F(A,B) = 7 + S2(A.B) — 5 I dx x? (pa (x) + pp(x)) 
1 
E J dxdy [oa (x) pa O) + og Œg O) In |x — yl. (10.85) 


This result was first derived in Matytsin [1994], and proven rigorously in Guionnet and 
Zeitouni [2002]. Note that this expression is symmetric in A, B, as it should be, because 
the solution of the Euler-Matytsin equations for the time reversed path from pg to pa are 
simply obtained from p(x,t) > p(x,1 — t) and v(x,t) = —v(x,1 — t), which leaves 
S2 (A, B) unchanged. 

The whole calculation above can be repeated for the 6 = 1 (orthogonal group) or 
B = 4 (symplectic group) with the final (simple) result Fg (A, B) = 6 F2(A, B)/2. 
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Free Probabilities 


In the previous chapter we saw how to compute the spectrum of the sum of two large 
random matrices, first when one of them is a Wigner and later when one is “rotationally 
invariant” with respect to the other. In this chapter, we would like to formalize the notion 
of relative rotational invariance, which leads to the abstract concept of freeness. 

The idea is as follows. In standard probability theory, one can work abstractly by defin- 
ing expectation values (moments) of random variables. The concept of independence is 
then equivalent to the factorization of moments (e.g. Z[A? B?] = E[A?]E[B2] when A and 
B are independent). 

However, random matrices do not commute in general and the concept of factorization 


of moments is not powerful enough to deal with non-commuting random objects. Follow- 
ing von Neumann, Voiculescu extended the concept of independence to non-commuting 
objects and called this property freeness. He then showed how to characterize the sum and 
the product of free variables. It was later realized that large rotationally invariant matrices 
provide an explicit example of (asymptotically) free random variables. In other words, free 
probabilities gave us very powerful tools to compute sums and products of large random 
matrices. We have already encountered the free addition; the free product will allow us to 
study sample covariance matrices in the presence of non-trivial true correlations. 

This chapter may seem too dry and abstract for someone looking for applications. Bear 
with us, it is in fact not that complicated and we will keep the jargon to a minimum. The 
reward will be one of the most powerful and beautiful recent developments in random 
matrix theory, which we will expand upon in Chapter 12. 


11.1 Algebraic Probabilities: Some Definitions 
The ingredients we will need in this chapter are as follows:! 


¢ A ring R of random variables, which can be non-commutative with respect to the multi- 
plication.” 


1 In mathematical language, the first three items give a *-algebra, while t gives a tracial state on this algebra. 
2 Recall that a ring is a set equipped with two binary operations that generalize the arithmetic operations of addition and 
multiplication. 
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¢ A field of scalars, which is usually taken to be C. The scalars commute with everything. 
An operation x, called involution. For instance, x denotes the conjugate for com- 


plex numbers, the transpose for real matrices, and the conjugate transpose for complex 
matrices. 

A positive linear functional t(.) (R —> C) that satisfies r(AB) = t(BA) for A,B ER. 
By positive we mean t(AA%*) is real non-negative. We also require that t be faithful, 
in the sense that t(AA*) = 0 implies A = 0. For instance, t can be the expectation 
operator E[.] for standard probability theory, or the normalized trace operator + Tr(.) 
for a ring of matrices, or the combined operation x S[Tr(.)]. 


We will call the elements in R the random variables and denote them by capital letters. 
For any A € R and k € N, we call t(A*) the kth moment of A and we assume in what 
follows that t (A*) is finite for all k. In particular, we call t (A) the mean of A and t(A2) — 
t(A)? the variance of A. We will say that two elements A and B have the same distribution 
if they have the same moments of all orders.* 

The ring of variables must have an element called 1 such that Al = 1A = A for every 
A. It satisfies t(1) = 1. We will call 1 and its multiples œ1 constants. Adding a constant 
simply shifts the mean as 


t(A +a1) = (A) +a. (11.1) 


11.2 Addition of Commuting Variables 


In this section, we recall some well-known properties of commuting random variables, i.e. 
such that 


AB=BA, VA,BER. (11.2) 


Note that A is not necessarily a real (or complex) number but can be an element of 
a more abstract ring. We will say that A and B are independent, if t(p(A)q(B)) = 
T(p(A))t(q(B)) for any polynomial p,q. This condition is equivalent to the factorization 
of moments. 

From a scalar œ we can build the constant w1 and write A+a@ to mean A+a1. Constants 
of the ring are independent of all other random variables, so if A and B are independent, 
A + a and B are also independent. This setting recovers the classical probability theory of 
commutative random variables (with finite moments to every order). 


11.2.1 Moments 


Now let us study the moments of the sum of independent random variables A + B. First we 
trivially have by linearity 


T(A+ B) =T(A)+7(B). (11.3) 


3 This is of course not correct for standard commuting random variables: some distributions are not uniquely determined by 
their moments. 


11.2 Addition of Commuting Variables 157 


From now on we will assume t(A) = T(B) = 0, i.e. A, B have mean zero, unless stated 
otherwise. For a non-zero mean variable A, we write A = A — (A) such that tT(A) = 0. 
One can recover the formulas for moments and cumulants of A simply by substituting 
A —> Å — (A) in all formulas written for zero mean A. The procedure is straightforward 
but leads to rather cumbersome results. 

For the second moment, 


t (a + B)’) = 1(A2) + t(B?) + 2t(AB) 
= T (AŽ) + t(B’) + 2t(A)t(B) = tT (A?) + T(B?), (11.4) 
i.e. the variance is also additive. For the third moment, we have 
T (a + B») = 1(A®) + T (B?) +31 (A)t (B?) + 3t(B)t(A’) = t (4°) + t (B°), 
(11.5) 
which is also additive. However, the fourth and higher moments are not additive anymore. 


For example we get, expanding (A + B)*, 


t (a aa By') = 1(A*) + 1(B4) + 6t(A2)r(B?). (11.6) 


11.2.2 Cumulants 


For zero mean variables the first three moments are additive but not the higher ones. 
Nevertheless, certain combinations of higher moments are additive; we call them cumulants 
and denote them as x, for the nth cumulant. Note that for a variable with non-zero mean 
A, the second and third cumulants are the second and third moments of A := A — (A): 


(A) = t (Â), 
K2(A) = t(A*) = tT (4°) — T (5), (11.7) 
«3(A) = t(A®) = t(A?) — 3t (ADT (A) + 21 (A). 


For the fourth cumulant, let us consider for simplicity zero mean variables A and B, and 
define «4 as 


k4(A) := t (AÍ) — 31 (47°. (11.8) 
Then we can verify that 
K4(A+B)=t (a + By‘) = 9% (a $ BP) 
= t (Af) + t (B4) + 6r (42r (B) — 3 (t43 + rB) 
= t (AÍ) — 3t (4°)? + T (B4) — 31 (B°)? = k4 (A) + k4(B), (11.9) 


which is additive again. 
In general, t((A + B)”) will be of the form t(A”) + t(B”) plus some homogeneous 
mix of lower order terms. We can then define the nth cumulant «, iteratively such that 
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Kn (A + B) = kn(A) + kn(B), (11.10) 
where 
Kn (A) = T(A”) + lower order terms moments. (11.11) 


In order to have a compact definition of cumulants, recall that we are looking for quantities 
that are additive for independent variables. But we already know that the log-characteristic 
function introduced in Eq. (8.4) is additive. In the present context, we define the character- 


istic function asf 
gat =t (e4) i (11.12) 
where the exponential function is formally defined through its power series: 
OO vg 
: it 
r(e#4) = SS ora, (11.13) 
t=0 ` 


hence the characteristic function is also the moment generating function. Now, from the 
formal definition of the exponential and the factorization of moments one can easily show 
that for independent, commuting A, B, 


PA+B(t) = pa (t)pB t). (11.14) 


Here is an algebraic proof. For each k, 


k 
(A+ B)*)= >> (jabra, (11.15) 


i=0 
with which we get 


iki ecpk-iy (iti ccai 
Ot re ye (‘)* (A‘)t (Be joy D A ) (it) at ) 


k—i)! 
k=0i<k i<k ( i) : 


it)! (A! it)/ c(Bi 
= (= o a ») (= = A | = p4 (t)og(t). (11.16) 


j 


We now define H4 (t) := log p4 (t). Then, for independent, commuting A, B, we have 
Ha+p(t) = H4 (t) + Hg (t). (11.17) 


We can expand H(t) as a power series of t and call the corresponding coefficients the 
cumulants, i.e. 


Ha(t) = loge (e ae se ) ap". (11.18) 


n=1 


4 The factor i in the definition is not necessary in this setting as the formal power series of the exponential and the logarithm do 
not need to converge. We nevertheless include it by analogy with the Fourier transform. 
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From the additive property of H, the cumulants defined in the above way are automatically 
additive. In fact, using the power series for log(1 + x), we have 


[sa] =i n—-1 oO vak a Saad a A 
Hat) = 0! : (> ae ran) De wy. (11.19) 


n=1 k=1 n=l 


Matching powers of (it) we obtain an expression for Kn for any n. We can work out by hand 
the first few cumulants. For example, for n = 1, Eq. (11.19) readily yields xı (A) = T(A). 
We now assume A has mean zero, i.e. T(A) = 0. Then 


; t t t 
t ‘Ga ee 4 2s a el 1 W a3 y ss j OEN (11.20) 
whereas the first few terms in the expansion of (11.19) are 


GN? a n uA ay 
Ha) = a) + = A a; )+ 


~ r(A’) + (i n*( -a120 


from which we recover the first four cumulants defined above: 
K(A) =0, «2(A) =T(4A°),  «3(A) =7(A®), K4(A) = 1(A*) — 31 (4°. (11.22) 


The expression for kn soon becomes very cumbersome for larger n. Nevertheless, by 
exponentiating Eq. (11.18) and matching with Eq. (11.13), one can extract the following 
moment—cumulant relation for commuting variables: 


ry 12 
T(A") = y did i am 
GADON- lar tral ra! 


F1, F25... n20 
ri t2ro+ -e +nrn=n 


= Kn + products of lower order terms + «KẸ? . 


(11.23) 


In particular, the scaling properties for the moments and cumulants (see (11.28) below) 
are consistent due to the relation ry + 2r2 +--+: +nrn =n. 


Exercise 11.2.1 Cumulants of a constant 
Show that a constant a1 has kı = a and kn = 0 for n > 2. (Hint: compute 


Hoa (k) = log (t (e*#1)).) 


11.2.3 Scaling of Moments and Cumulants 


Moments and cumulants obey simple transformation rules under scalar addition and multi- 
plication. For example, when adding a constant to a variable, A := A-+a, where t(A) = 0, 
we only change the first cumulant: 


K1(A) =a and k,(A)=Kk,(A) for n>2. (11.24) 
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For the case of multiplication by an arbitrary scalar a, by commutativity of scalars and 
linearity of t we have 


T (œa) = okt (4‘) (11.25) 
For the cumulant, we first look at the scaling of the log-characteristic function: 
Hya(t) = log (z e) = H4 (at). (11.26) 


And by (11.18), we have 


ee) 


n 
A 
Haa(t) = Ha(at) = Y E (11.27) 
n! 
n=1 
Thus we have the simple scaling property 
Kn(& A) = &”kn (A). (11.28) 


11.2.4 Law of Large Numbers and Central Limit Theorem 


Continuing our study of algebraic probabilities, we would like to recover two very impor- 
tant theorems in probability theory, namely the law of large numbers (LLN) and the central 
limit theorem (CLT). The first states that the sample average converges to the mean (a 
constant) as the number of observations N — ov, and the second that a large sum of 
properly centered and rescaled random variables converge to a Gaussian. 

First we need to define in our context what we mean by a constant and a Gaussian. For 
simplicity, we can think of the variables in this section as standard random variables. We 
will later introduce non-commutating cumulants. The arguments of this section apply in 
the non-commutative case with independence replaced by freeness. 

We have defined the constant variable A = a1, which satisfies 


Ky(A)=a, Ke(A)=0, VE> 1. (11.29) 
Then we define the “Gaussian” random variable as an element A that satisfies 
K2(A) #0, Ke(A) =0, V0 > 2. (11.30) 


Note that this definition (in the scalar case) is equivalent to the standard Gaussian random 
variable with density 


P yaa =H)" 11.31 
u,02(X) = Vengo 72 ; (11.31) 


with kı = u and k? = o?. 


By extension, we call kı(A) the mean, and «2(A) the variance. Now we can give a 
simple proof for the LLN and CLT within our algebraic setting. Let 
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I£ 
Sk := Kín (11.32) 
where A; are K 11D variables.5 Then by (11.28) and the additive property of cumulants, we 


get that 


K K> kı(A), if£=1, 
re(SK) = peel) = i ee (11.33) 


In other words, Sx converges to a constant in the sense of cumulants. 
Assume now that xı (A) = 0 and consider 


Sk := ——} A (11.34) 


Then it is easy to see that 


0, if 2=1, 
Ko(Sk) = rae > gaa if =2, (11.35) 
0, if 2 > 2. 


In other words, Sk converges to a “Gaussian” random variable with variance K2(A) = 
t(A2), in the sense of cumulants. 

In our algebraic probability setting we have made the assumption that the variables we 
consider have finite moments of all orders. This is a very strong assumption. In particular it 
excludes any variable whose probability decays as a power law. If we relaxed this assump- 
tion we would find that some sums of power-law distributed variables converge not to a 
Gaussian but to a Lévy-stable distribution. A similar concept exists in the non-commutative 
case, but it is beyond the scope of this book. 


11.3 Non-Commuting Variables 


We now return to our original goal of developing an extension of standard probabilities for 
non-commuting objects. One of the goals is to generalize the law of addition of independent 
variables. We consider a variable equal to A+ B where A and B are now non-commutative 
objects such as large random matrices. If we compute the first three moments of A + B, 
no particular problems arise thanks to the tracial property of t, and they behave as in the 
commutative case. For example, consider the third moment 


5 up copies are variables A; that have exactly the same cumulants and are all independent. We did not define independence for 


more than two variables but the factorization of moments can be easily extended to more variables. Note that pairwise 
independence is not enough to assure independence as a group. For example if x), x2 and x3 are 11D Gaussians, the Gaussian 
variables x], x2 and y3 = sign(x1x)|x3| are pairwise independent but not all independent as E[x] x7 y3] > 0 whereas 

Elx JE[x2 JE[y3] = 0. 
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t((A + B)°) = 1(A®) + 1(A*B) + T(ABA) + 1(BA’) 
+ 1(AB’) + t(BAB) + t(B?A) + t(B?). (11.36) 


But since t(A2B) = t(ABA) = t(BA?) (and similarly when A appears once and B 
twice), the classic independence property 


t(A*B) = t(A*)t(B) = 0 (11.37) 
appears to be sufficient. Things become more interesting for the fourth moment. Indeed, 
t((A + B)*) = t (Af) + t(A3B) + 1(A7BA) + 1(ABA’) + 1(BA?) 
+ t (A? B?) + t(ABAB) + t(BA*B) + t(AB*A) + t(BABA) 
+ t (B? A’) + t(B?A) + t(B7AB) + t(BAB*) + t(AB?) + 1(B*) 


= 1(A*) + 4T (A? B) + 4t (A? B?) + 2t(ABAB) + 41(AB?) + 1(B4), 
(11.38) 


where in the second step we again used the tracial property of t. For commutative random 
variables, independence of A, B means t(A2B*) = t(A’)r(B2), and this is enough to 
treat all the terms above. In the non-commutative case, we also need to handle the term 
t (ABAB). In general ABAB is not equal to A*B*. “Independence” is therefore not 
enough to deal with this term, so we need a new concept. A radical solution would be 
to postulate that t(ABAB) is zero whenever t(A) = t(B) = 0. As we compute higher 
moments of A + B we will encounter more and more complicated similar mixed moments. 
The concept of freeness deals with all such terms at once. 


11.3.1 Freeness 


Given two random variables A, B, we say they are free if for any polynomials pi, ..., Pn 
and q1, . . . ,qn Such that 


T(pk(A))=0,  t(qx(B)) = 0, Vk, (11.39) 


we have 


T (p1(A)qi(B) p2(A)q2(B) «++ Pn(A)gn(B)) = 0. (11.40) 


We will call a polynomial (or a variable) traceless if t(p(A)) = 0. Note that a1 is free with 
respect to any A € R because t(p(a1)) = p(a1), from the definition of 1. Hence, 


t(p(al)) =0 & pial) = 0. (11.41) 


Moreover, it is easy to see that if A, B are free, then p(A), q(B) are free for any polynomials 
p,q. By extension, F(A) and G(B) are also free for any function F and G defined by their 
power series. 

The freeness is non-trivial only in the non-commutative case. For the commutative case, 
it is easy to check that A, B are free if and only if either A or B is a constant. Free random 
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variables are “maximally” non-commuting, in a sense made more precise for the example 
of free random matrices in the next chapter. For example, for free and mean zero variables A 
and B, we have t(ABAB) = 0 whereas t (A? B?) = t((A2 — t (A?)) B?) + t(A”)t(B?) = 
t(A’)t(B2). 

Assuming A, B are free with r(A) = t(B) = 0, we can compute the moments of the 
free addition A + B. The second moment is easy: 


t((A+ B)’) = 1(A*) + t (B°) + 2t(AB) = 1(A*) + t (B°), (11.42) 
because both t(A),t(B) are zero. For the third and higher moments the trick is, as just 
above, to add and subtract quantities such that, in each term, at least one object of the form 
(C — t(C)) is present: 

t((A + B)*) = t(A®) + t(B?) + 3t(A7B) +31 (AB°) 

= T (A?) + T (B?) + 3t((A? — 1(A*)) BB) + 3 (A21 (B) 
+ 3r(A(B? — t(B°))) + 3t (A)r (B?) 
= t (A?) + t(B°), (11.43) 
and for the fourth moment: 
t((A + B)*) = t (Af) + 4t (A? B) + 4t (A° B?) + 2r(ABAB) + 4r (AB?) + 1(B*) 
= 1(A*) + 4T ((4°? — t (A°))B) + 4r (A°)t (B) 
+ 4r ((4° — t (AP) (B? — 1(B?))) +41 (4°)t (B°) 
+ 4t (A(B? — t(B°))) + 4r (A)r (B?) + 1(B4) 
= t (AÍ) + T (BÍ) + 4T (A®)r (BP). (11.44) 
In particular, we find 


t((A + B)*) —2r((A+ By’) = t(A*) + T (B^) — 21(A*)* — 21(B7)*. (11.45) 


11.3.2 Free Cumulants 


Let us define the cumulants of A as 


kı (A) = (A), k(A) = T(AQ),  &3(A) = (AB), k4(A) = T(A9) — 2t (42), 
(11.46) 


where Ao is a short-hand for A — t(A)1. Then these objects are additive for free random 
variables. The first three are the same as the commutative ones. But for the fourth cumulant, 
the coefficient in front of (Ab)? is now 2 instead of 3. Higher cumulants all differ from 
their commutative counterparts. 

As in the commutative case, we can define the kth free cumulant iteratively as 


KR(A) = r(A* ) + homogeneous products of lower order moments, (11.47) 
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such that 
kk(A + B) = kk(A) + kx (B), Vk, (11.48) 


whenever A, B are free. 
An important example of non-commutative free random variables is two independent 
large random matrices where one of them is rotational invariant — see next chapter. 


11.3.3 Additivity of the R-Transform 


In the previous chapter, we saw that the R-transform is additive for large rotationally 
invariant matrices. We will show here that we can define the R-transform in our abstract 
algebraic probability setting and that this R-transform is also additive for free variables. In 
the next chapter, we will dwell on why large rotationally invariant matrices are free. 

First we define the Stieltjes transform as a moment generating function as in (2.22); we 
can define g,(z) for large z as 


— 1 
ga(z) = J aat(A"). (11.49) 
k=0 2 


Then we can also define the R-transform as before: 
1 
Ra(g) := 3a(g) — z (11.50) 


for small enough g. Here the inverse function 34 (g) is defined as the formal power series 
that satisfies g4(34(g)) = g to all orders. 
We now claim that the R-transform is additive for free random variables, i.e. 


Ra+B(g) = Ra (8) + Rg (8), (11.51) 


whenever A, B are free. 
We let 34(g) be the inverse function of 


aa =t[@— Ay], (11.52) 
whose power series is actually given by (11.49). Consider a fixed scalar g. By construction 
t(gl) = g = gaGsa(g)) = t | Ga(g) — AY]. (11.53) 


The arguments of t (.) on the left and on the right of the above equation have the same mean 
but they are in general different, so let us define their difference as gX 4 via 


gXa :=(z4—A) 1-81, (11.54) 


where z4 := 34(g). From its very definition, we have t(X 4) = 0. 


© Freeness is only exact in the large N limit of random matrices. 
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We can invert Eq. (11.54) and find 
1 
A= z= E (11.55) 
& 


Consider another variable B, free from A. For the same fixed g we can find the scalar 
ZB := 38 (g) and define Xg with t(Xg) = 0 as for A, to find 


1 
B—zg =——(1+ Xp). (11.56) 
8 
Since X 4 and Xp are functions of A and B, X 4 and Xp are also free. Now, 
1 1 
A+B-—za =z =--(1+X4)'—-(1+X)' 
g E 


-E04 XAT Q+ XA + XDA Xp. (11.57) 


Hence, noting that (1 + Xa)(1 + Xg) + 1— X4Xp =2+ Xa + XB, 


1 1 
A+B-—za—zg+-=-—-(1+X%œ⁄)7! -—X4Xp) ee !, 
g g 


A 
E +B- (z4 + 2B 2) = —g(l + Xg) — Xa Xp A + Xa). (11.58) 


Using the identity 
[0.0] 
(1— X4XB)' = J (X4XB)", (11.59) 
n=0 


we can expand the expression 
t [Q+ Xa) - XaXay 1+ Xa)], (11.60) 


which will contain | plus terms of the form t(XAXsBXAXpB...Xpg) where the initial and 
final factor might be either X4 or Xpg but the important point is that X4 and Xz always 
alternate. By the freeness and zero mean of X 4 and X g, all these terms are thus zero. Hence 
we get 


1\J-! 
ef[a+s- (arco 2] | = -s> matata- dme (11.61) 


finally leading to the announced result:’ 


3448 =ZA +zB — 87! > Rap = R4 + RB. (11.62) 


7 The above compact proof is taken from Tao [2012]. 
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11.3.4 R-Transform and Cumulants 


The R-transform is defined as a power series in g. We claim that the coefficients of this 
power series are in fact exactly the non-commutative cumulants defined earlier. In other 
words, R4(g) is the cumulant generating function: 


Ra(g) := È (A). (11.63) 
k=1 


To show that these coefficients are indeed the cumulants we first realize that the general 
equality R(g) = 3(g) — 1/g is equivalent to 
zga z) — 1 = ga(z)Ra(Ga(z)). (11.64) 
We can compute the power series of the two sides of this equality: 
[0,0] 
mk 
zga(z) -1= > =, (11.65) 
k=1 $ 


where mg := t(A*) denotes the kth moment, and 


OO [00] k 
1 
ga) Ra 0a) = J ke ( +5 an] . (11.66) 
k=1 $ tl z 


Equating the right hand sides of Eqs. (11.65) and (11.66) and matching powers of 1/z we 
get recursive relations between moments (mg) and cumulants (Kx): 


mı =K] => m =K, 
m2 = K2 + kim, > m = ko + Kj, 
m3 = K3 + 2kom, + kım2 => M3 = K3 + 3k2k1 + KÌ, 


m4 = k4 + 4k3mı + k2[2m2 + mî] +km3 => m4,=k4+ 6K2K; + 2K? + 4k3K1 + ke 
(11.67) 


By looking at the z~* term coming from the [1/z + --- ]‘ term in Eq. (11.66) we realize 
that mg = kg +--- where “:--” are homogeneous combinations of lower order xz and mg. 
Since the coefficients of the power series Eq. (11.63) are additive under addition of free 
variables and obey the property 


KK(A) =T (A® ) + homogeneous products of lower order moments, (11.68) 


they are therefore the cumulants defined in Section 11.3.2. 


11.3.5 Cumulants and Non-Crossing Partitions 


We saw that Eq. (11.64) can be used to compute cumulants iteratively. Actually that 
equation can be translated into a systematic relation between moments and cumulants: 


mn = Y km Kags (11.69) 
mENC(n) 
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Figure 11.1 List of all non-crossing partitions of four elements. In Eq. (11.69) for m4, the first 
partition contributes k]kįk]įk] = K: The next six all contribute kak? and so forth. We read 


m4 = Kt + 6K 2K; + 2K + 4k3K1 + K4. 


Figure 11.2 A typical non-crossing diagram contributing to a large moment mn. In this example 
the first element is connected to four others (giving a factor of «5) which breaks the diagram into 
five disjoint non-crossing diagrams contributing a factor me,me,me,me,me,. Note that we must 
have £1 + 42+ £3 + 4+ £5 =n. 


where m € NC(n) indicates that the sum is over all possible non-crossing partitions of n 
elements. For any such partition x the integers {7 ,72,...,7¢,} (1 < € < n) equal the 
number elements in each group (see Fig. 11.3). They satisfy 


bn 
n=} rg. (11.70) 
k=1 


We will show that, provided we define cumulants by Eq. (11.69), we recover 
Eq. (11.64). But before we do so, let us first show this relation on a simple example. 
Figure 11.1 shows the computation of the fourth moment in terms of the cumulants. 

The argument is very similar to the recursion relation obtained for Catalan numbers 
where we considered non-crossing pair partitions (see Section 3.2.3). Here the argument 
is slightly more complicated as we have partitions of all possible sizes. We consider the 
moment m, for n > 1. We break down the sum over all non-crossing partitions of n 
elements by looking at £, the size of the set containing the first element (for example in 
Fig. 11.3, the first element belongs to a set of size £ = 5). The size of this first set can 
be 1 < £ < n. This initial set breaks the partition into £ (possibly empty) disjoint smaller 
partitions. They must be disjoint, otherwise there would be a crossing. In Figure 11.2 we 
show how an initial 5-set breaks the full partition into 5 blocks. In each of these blocks, 
every non-crossing partition is possible, the only constraint is that the total size of the 
partition must be n. The sum over all possible non-crossing partitions of size k of the 
relevant «’s is the moment mg. Note that the empty partition contributes a multiplicative 
factor 1, so we define mg = 1. Putting everything together we obtain the following 
recursion relation for my: 


m=} ke I] Mk, Mk- - Mke- (11.71) 
t=1 kı, k2 paag k20 
kj tkot--+kp=n—€ 


Let us multiply both sides of this equation by z~” and sum over n from 1 to oo. The left 
hand side gives zg(z) — 1, by definition of g(z). The right hand side reads 
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Figure 11.3 Generic non-crossing partition of 23 elements with two singletons, five doublets, two 
triplets, and one quintent, such that 23 = 54+2-3+5-2+42-1.In Eq. (11.69), this particular partition 


appears for m23 and contributes K5KZK> KG 


waaa =! ELI 1 LUT 4 LL 


Figure 11.4 List of all non-crossing partitions of three elements. From this we get Eq. (11.74) for 
three elements: t (A1 A243) = Kı (A1)kı (A2)k1 (43) + K2(Aq, A2)kı (43) + k2 (A1, A3)kı (A2) + 
K2(A2, A3)k1(A3) + K3 (A1, A2, A3). 


CO n 

Mk Mky -Mke 
T 2 Ke IT Sky +ho+--+kesn—l zithi gith zlFke’ (11.72) 
n=1 €=1 ki, ko,..., kp=0 os 


which can be transformed into 
£ 


Co (00) 
Doe De Ses = g(z)R(g(z)), (11.73) 
l=1 k,=0 


where we have used Eq. (11.63). We thus recover exactly Eq. (11.64), showing that the 
relation (11.69) is equivalent to our previous definition of the free cumulants. 

It is interesting to contrast the moment—cumulant relation in the standard (commuta- 
tive) case (Eq. (11.23)) and the free (non-commutative) case (Eq. (11.69)). Both can be 
written as a sum over all partitions on n elements; in the standard case all partitions are 
allowed, while in the free case the sum is only over non-crossing partitions. 


11.3.6 Freeness as the Vanishing of Mixed Cumulants 


We have defined freeness in Section 11.3.1 as the property of two variables A and 
B such that the trace of any mixed combination of traceless polynomials in A and in B 
vanishes. There exists another equivalent definition of freeness, namely that every mixed 
cumulant of A and B vanish. To make sense of this definition we first need to introduce 
cumulants of several variables. They are defined recursively by 


t(AjA2...An)= J` kx (Aj AQ... An), (11.74) 
mENC(n) 


where the A;’s are not necessarily distinct and NC() is the set of all non-crossing parti- 
tions of n elements. Here 


kn (Ay Ag... An) = Kar (00) eos Katy) (11.75) 


are the products of cumulants of variables belonging to the same group of the correspond- 
ing partition — see Figure 11.4 for an illustration. We also call these generalized cumulants 
the free cumulants. 

When all the variables in Eq. (11.74) are the same (A; = A) we recover the previous 
definition of cumulants with a slightly different notation (e.g. «3(A,A,A) = K3(A)). 
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Cumulants with more than one variable are called mixed cumulants (e.g. «4(A, A, B, A)). 
By applying Eq. (11.74) we find for the low generalized cumulants of two variables 
m (A) = Kı (A), 
m2(A, B) = kı (B)kı (B) + k9(A, B), 


m3(A, A, B) = k1 (A)?k1 (B) + k2 (A, A)kı (B) + 2x2 (A, B)k1 (A) + k3 (A, A, B). 
(11.76) 


We can now state more precisely the alternative definition of freeness: a set of variables 
is free if and only if all their mixed cumulants vanish. For example, in the low cumulants 
listed above, freeness of A and B implies that x2 (A, B) = k3 (A, A, B) = 0. 

This definition of freeness is easy to generalize to a collection of variables, i.e. a 
collection of variables is free if all its mixed cumulants are zero. As noted at the end 
of Section 11.3.7 below, pairwise freeness is not enough to ensure that a collection is free. 

We remark that vanishing of mixed cumulants implies that free cumulants are additive. 
In Speicher’s notation, xg (A, B,C, ...) is a multi-linear function in each of its arguments, 
where k gives the number of variables. Thus we have 

kk(A + B,A + B,...) = Ke(A,A,...) + kx (B, B, ...) + mixed cumulants 
= K(A,A,...) + Kx (B,B,...), (11.77) 
i.e. Kx is additive. 

We will give a concrete application of the formalism of free mixed cumulants in 

Section 12.2. 


11.3.7 The Central Limit Theorem for Free Variables 


We can now go back and re-read Section 11.2.4. We can replace every occurrence of the 
word independent with free, and cumulant with free cumulant. The LLN now States that the 
sum of K free identically distributed (FID) variables normalized by K~! converges to a 
constant (also called a scalar) with the same mean. 

Let us define a free Wigner variable as a variable with second free cumulant «2 > 0 and 
all other free cumulants equal to zero. In other words, a free Wigner variable is such that 
Rw(x) = k2x. The cit then states that the sum of K zero-mean free identical variables 
normalized by K~!/* converges to a free Wigner variable with the same second cumulant. 

In the case where our free random variables are large symmetric random matrices, 
the Wigner defined here by its cumulant coincides with the Wigner matrices defined in 
Chapter 2. We indeed saw that the R-transform of a Wigner matrix is given by R(x) = 07x, 
i.e. the cumulant generating function has a single term corresponding to k2 = 0°. 

Alternatively, we note that the moments of a Wigner are given by the sum over non- 
crossing pair partitions (Eq. (3.26)). Comparing with Eq. (11.69), we realize that partitions 
containing anything other than pairs must contribute zero, hence only the second cumulant 
of the Wigner is non-zero. 


The LLN and the CLT require variables to be collectively free, in the sense that all mixed 
cumulants are zero. As is the case with independence, pairwise freeness is not enough to 
ensure freeness as a collection (see footnote on page 161). Indeed, in Section 12.5 we will 
encounter variables that are pairwise free but not free as a collection. One can have A 
and B mutually free and both free with respect to C but A + B is not free with respect 
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to C. This does not happen for rotationally invariant large matrices but can arise in other 
constructions. The definition of a free collection is just an extension of definition (11.40) 
including traceless polynomials in all variables in the collection. With this definition, sums 
of variables in the collection are free from those not included in the sum (e.g. A + B is 
free from C). 


11.3.8 Subordination Relation for Addition of Free Variables 


We now introduce the subordination relation for free addition, which is just a rewriting of 
the addition of R-transforms. For free A and B, we have 


Ra(g) + Rp(g) = Ra+s(g) > 34(8) + RB(g) = 34+8 (8), (11.78) 


where 


GA+B(3A+B) = & = GA(ZA) = GA (34+B — RB(Q))- (11.79) 


We call z := 34+g (2), then the above relations give 


GA+B(Z) = Ga Z — Ra(Ga+B(Z))), (11.80) 


which is called a subordination relation (compare with Eq. (10.2)). 


11.4 Free Product 


In the previous section, we have studied the property of the sum of free random variables. In 
the case of commuting variables, the question of studying the product of (positive) random 
variables is trivial, since taking the logarithm of this product we are back to the problem of 
sums again. In the case of non-commuting variables, things are more interesting. We will 
see below that one needs to introduce the so-called S-transform, which is the counterpart 
of the R-transform for products of free variables. 

We start by noticing that the free product of traceless variables is trivial. If A, B are free 
and t(A) = t(B) = 0, we have 


t((AB)*) = t(ABAB... AB) = 0. (11.81) 


11.4.1 Low Moments of Free Products 


We will now compute the first few moments of the free products of two variables with a 
non-zero trace: C := AB where A, B are free and t(A) + 0, t(B) + 0. Without loss of 
generality, we can assume that t (A) = t(B) = 1 by rescaling A and B. Then 


t(C) =t (A — 7(A))(B — t(B))) + t (A)r (B) = T(A)t(B) = 1. (11.82) 
We can also use (11.74) to get 


T(C) = k2(A, B) + kı (A)kı (B) = kı (A)kı (B) = 1, (11.83) 
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Figure 11.5 List of all non-crossing partitions of six elements contributing to t(ABABAB) 
excluding mixed AB terms. Terms involving A are in thick gray and B in black. Equation (11.86) 
can be read off from these diagrams. Note that kı (A) = k2 (B) = 1. 


since mixed cumulants are zero for mutually free variables. Similarly, using Eq. (11.74) we 
can get that (see Fig. 11.1 for the non-crossing partitions of four elements) 


t(C*) = t(ABAB) = «i (A)7«1(B)? + 2(A)k1 (BY? + Kı (A)*«2(B) 
= 14 &2(A) + «2(B), (11.84) 


which gives 
Ko(C) := t(C*) — t(C)* = (A) + k2(B). (11.85) 


For the third moment of C = AB, we can follow Figure 11.5 and get 


t(C*) = t(ABABAB) 
= 1 4+3«2(A) + 3«2(B) + 3K2(A)K2(B) + «3 (A) + «3(B), (11.86) 
leading to 
«3(C) : = T (C?) — 3t (CT (C) + 21 (0) 
= K3 (A) + K3 (B) + 3K2(A)K2(B). (11.87) 


Under free products of unit-trace variables, the mean remains equal to one and the 
variance is additive. The third cumulant is not additive; it is strictly greater than the sum of 
the third cumulants unless one of the two variables is the identity (unit scalar). 


11.4.2 Definition of the S-Transform 


We will now show that the above relations can be encoded into the S-transform S(t) which 
is multiplicative for products of free variables: 


Sap(t) = Sa(t)SB(t) (11.88) 
for A and B free. To define the S-transform, we first introduce the T-transform as 
tae) =e [—o-tay'] -1 
= fga(f) — 1 


mk (11.89) 
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The behavior at infinity of the T-transform depends explicitly on mı, the first moment of A 
(t(¢) ~ m,/€), unlike the Stieltjes transform which always behaves as 1/z. 

The T-transform has the same singularities as the Stieltjes transform except maybe at 
zero. When A is a matrix, one can recover the continuous part of its eigenvalue density 
p(x) using the following T-version of the Sokhotski—Plemelj formula: 


lim Imt(x —in) = mxp(x). (11.90) 
n—>0+ 


Poles in the T-transform indicate Dirac masses. If t(¢) ~ A/(¢ — Ao) near Ao then the 
density is a Dirac mass of amplitude A /àọ at Ao. The behavior at zero of the T-transform is 
a bit different from that of the Stieltjes transform. A regular density at zero gives a regular 
Stieltjes and hence t(0) = —1. Deviations from this value indicate a Dirac mass at zero, 
hence when t(0) + —1, the density has a Dirac at zero of amplitude t(0) + 1. 

The T-transform can also be written as 


ta(t)=t |A (¢ — Ay] (11.91) 


We define ¢4 (t) to be the inverse function of t4(¢). When mı + 0, t4 is invertible for large 


¢, and hence ¢, exists for small enough ft. We then define the S-transform as? 
t+1 
Sa(t) := ; (11.92) 
tta (t) 
for variables A such that t (A) # 0. 
Let us compute the S-transform of the identity S1 (t): 
1 t+1 
t) = zI > 4 (t) = ae Sy(t) = 1, (11.93) 


as expected as the identity is free with respect to any variable. The S-transform scales in a 
simple way with the variable A. To find its scaling we first note that 


tae) = t [0 - @71 A] - 1 = taa), (11.94) 
which gives 
aa (t) = ġa (t). (11.95) 
Then, using (11.92), we get that 
Saa (t) = a7! S4 (t). (11.96) 


The above scaling is slightly counterintuitive but it is consistent with the fact that 
Sa (0) = 1/t(A). We will be focusing on unit trace objects such that $(0) = 1. 


8 Most authors prefer to define the S-transform in terms of the moment generating function y(z) := t(1/z). The definition 
S(t) = yT! (t)(t + 1)/t is equivalent to ours (7! (t) is the functional inverse of w(z)). We prefer to work with the 
T-transform as the function ¢(¢) has an analytic structure very similar to that of g(z). The function y(z) is analytic near zero 
and singular for large values of z corresponding to the reciprocal of the eigenvalues. 
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The construction of the S-transform relies on the properties of mixed moments of free 
variables. In that respect it is closely related to the R-transform. Using the relation t4(¢) = 
€g4(¢) — 1, one can get the following relationships between R4 and S4: 


1 


S = h 
ANE SAGRA) 


R4 (8) = (11.97) 


Ra(tSa(t))’ 


11.4.3 Multiplicativity of the S-Transform 


We can now show the multiplicative property (11.88). The proof is similar to the one 
given for the additive case and is adapted from it. 

We fix ¢ and let ¢4 and fp be the inverse T-transforms of t4 and tg. Then we define 
E 4 through 


1+t+E4 =(1-A/ta)!, (11.98) 


and similarly for Eg. We have t(E4) = 0, t(Eg) = 0, and, since A and B are free, 
E 4, Ep are also free. Then we have 


A 
= Shetty (11.99) 
CA 


which gives 


AB -1 =j 
p n [Aire |[1- a++ E] 
=(1+t+ Ep [t +E) (t+Ep] (Q +t+ Eg) !. (11.100) 
Using the identity 


t 
(Ea + Ep) = [0+1 + Eg +4 Ep)-(40?- EE], 01101) 


we can rewrite the above expression as 


EEE EE z 1+ e atre aa 
tat OL +t 1+t 2 
1+t AB eyi EAEg —1 
1—- ————_ = (1+) +t + Ea) -4 (d +t+ Ep) 
t ČAČB td +t) 
1+t AB Į! 1 EAEg i 
=> |1- —— = —(l4+rt+E 1— l1+t+ Ea). 
| t al Te »| ras) | J 
(11.102) 
Using the expansion 
—1 n 
EAEB E,Ep 
1— = 11.103 
| | Bee, i ) 
n=0 
one can check that 
r| a+t+Ep]| 1- Z458 EEE T = (1+0? (11.104) 
B t1 +t) 2 f l 
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where we used the freeness condition for E4 and Eg. Thus we get that 


1+¢t AB a _ tCace = 


which gives that 
Spt) = Sa(t)SB() (11.106) 
thanks to the definition (11.92). 


11.4.4 Subordination Relation for the Free Product 


We next derive a subordination relation for the free product using (11.88) and (11.92): 


_ — Salt) 
Sap(t) = SAt) SB) = tagt) = , (11.107) 
Sp(t) 
where 
tag (taB (t)) = t = ta(Sa(t)) = taaa(t)Sa(t)). (11.108) 
We call ¢ := ¢4pg (t), then the above relations give 
tag (t) = ta (Sg (tag ())), (11.109) 


which is the subordination relation for the free product. In fact, the above is true even when 
Sa does not exist, e.g. when t (A) = 0. 

When applied to free random matrices, the form AB is not very useful since it is not 
necessarily symmetric even if A and B are. But if A > 0 (i.e. A is positive semi-definite 
symmetric) and B is symmetric, then A? BA? has the same moments as AB and is also 
symmetric. In our applications below we will always encounter the case A > 0 and call 
A? BA? the free product of A and B. 


Exercise 11.4.1 Properties of the S-transform 
(a) Using Eq. (11.92), show that 
1 
S(xR(x)) 
Hint: define t = x R(x) = zg — 1 and identify x as g. 
(b) For a variable such that t(M) = x; = 1, write S(t) as a power series in t, 


compute the first few terms of the powers series, up to (and including) the £? 
term, using Eq. (11.110) and Eq. (11.63). You should find 


R(x) = (11.110) 


S(t) = 1 — kot + (2k? — «3)t? + O(P). (11.111) 
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(c) We have shown that, when A and B are mutually free with unit trace, 


(AB) = 1, (11.112) 
t(ABAB) — 1 = «2(A) + «2(B), (11.113) 

t (ABABAB) = «3(A) + «3(B) + 3«2(A)K2(B) + 3(«2(A) + 2(B)) + 1. 
(11.114) 


Show that these relations are compatible with Sag(t) = Sa(t)Sp(t) and the 
first few terms of your power series in (b). 

(d) Consider M; = 1+ 0,;W, and M2 = 1 + o2W2 where W; and W2 are 
two different (free) unit Wigner matrices and both o’s are less than 1/2. My 
and Mp have x3 = 0 and are positive definite in the large N limit. What is 
k3(M M2)? 

Exercise 11.4.2 S-transform of the matrix inverse 


(a) Consider M an invertible symmetric random matrix and M^! its inverse. 
Using Eq. (11.89), show that 


1 
tM(¢) + ty-1 (=) +1=0. (11.115) 
(b) Using Eq. (11.115), show that 
1 
San —— 11.116 
mie) =~ (11.116) 
Hint: write u(x) = 1/¢(t) where u(x) is such that x = ty-1 (u(x)). Equation 
(11.115) is then equivalent to x = —1 — t. 
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In the last chapter, we introduced the concept of freeness rather abstractly, as the proper 
non-commutative generalization of independence for usual random variables. In the present 
chapter, we explain why large, randomly rotated matrices behave as free random variables. 
This justifies the use of R-transforms and S-transforms to deal with the spectrum of sums 
and products of large random matrices. We also revisit the abstract central limit theorem 
of the previous chapter (Section 11.2.4) in the more concrete case of sums of randomly 
rotated matrices. 


12.1 Random Rotations and Freeness 
12.1.1 Statement of the Main Result 


Recall the definition of freeness. A and B are free if for any set of traceless polynomials 
P1, ---, Pn and qi, . . . ,qn the following equality holds: 


t (pı(A)qı (B) p2(A)q2(B). . . Pn(A)gn(B)) = 0. (12.1) 


In order to make the link with large matrices we will consider A and B to be large symmetric 
matrices and t (M) := 1/N Tr(M). The matrix A can be diagonalized as UAU” and B as 
VA’V’.A traceless polynomial p; (A) can be diagonalized as UA; U”, where U is the same 
orthogonal matrix as for A itself and Aj; = p;(A) is some traceless diagonal matrix, and 
similarly for q; (B). Equation (12.1) then becomes 


t (AjOA{O7A20A50"... A,OA/,O*) = 0, (12.2) 


where we have introduced O = U” Y as the orthogonal matrix of basis change rotating the 
eigenvectors of A into those of B. 

As we argue below, in the large N limit Eq. (12.2) always holds true when averaged over 
the orthogonal matrix O and whenever matrices A; and A’ are traceless. We also expect 
that in the large N limit Eq. (12.2) becomes self-averaging, so a single matrix O behaves as 
the average over all such matrices. Hence, two large symmetric matrices whose eigenbases 
are randomly rotated with respect to one another are essentially free. For example, Wigner 
matrices X and white Wishart matrices W are rotationally invariant, meaning that the 
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matrices of their eigenvectors are random orthogonal matrices. We conclude that for N 
large, both X and W are free with respect to any matrix independent from them, in particular 
they are free from any deterministic matrix. 


12.1.2 Integration over the Orthogonal Group 


We now come back to the central statement that in the large N limit the average over O 
of Eq. (12.2) is zero for traceless matrices A; and Al. In order to compute quantities like 


(t (AyOA{O7A20A50"... AnOA),0"))g, (12.3) 


one needs to understand how to compute the following moments of rotation matrices, 
averaged over the Haar (flat) measure over the orthogonal group O(N): 
Ta,j,n) := (Oj, 0 


o (12.4) 


oA N Dino ; 


The general formula has been worked out quite recently and involves the Weingarten 
functions. A full discussion of these functions is beyond the scope of this book, but 
we want to give here a brief account of the structure of the result. When N — oo, the 
leading term is quite simple: one recovers the Wick’s contraction rules, as if O;, ; were 
independent random Gaussian variables with variance 1/N. Namely, 


iji 


es — Nyor” TREI f ; “ ; ; n —n-1 
TGj.n) =N 2 Big cyin@) j0) j0) ae bin 2n—1yix Qn) Ô in (2n—1) itn) +t O(N ). 


pairings 7 


(12.5) 


Note that all pairings of the i-indices are the same as those of the j-indices. For example, 
forn = 1 andn = 2 one has explicitly, for N > oo, 


N (Oi j Oio om 8i1in 9 ji j2 (12.6) 
and 
N? (0%, jı Oj, jy Oi; j Oiz jalo = 851 in ji jn Sizig Sj ja 
+ i135 ji j3 Sinig? ja ja 
+ Sip 149 ji ja Diniz ja j3- (12.7) 
The case n = 1 is exact and has no subleading corrections in N, so we can use it to 
compute 


N 
(r(AjOA4O"))\g =N7! YO (AQ); (0,4) 07), 
i, j=l 


N 
=N? X (AAD 
i,j=l 
= t(A1)t(A}). (12.8) 


(Recall that t(A) is equal to N <4 TrA.) Clearly the result is zero when t(A,) = 
T(A4) = 0, as required by the freeness condition. 
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ia] 
gel 
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Figure 12.1 Number of loops ¢(z,) for z = (1,2)(3, 6)(4,5). The bottom part of the diagram (thick 
gray) corresponds to the first partition x and the top part (black) to the second partition, here also 


equal to zr. In this example there are three loops each of size 2. 


Now, using only the leading Wick terms for n = 2 and after some index contractions 
and manipulations, one would obtain 


(t(AjOA‘,07 A2OA50"))g = TM ADT AITA) + T(Aq)t(A2)t(A4 AS). 
(12.9) 


lim 
N—> œ 


However, this cannot be correct. Take for example Aj = Az = 1, for which t (A1 OA) o7 
A204407) = t(A‘A5) exactly, whereas the formula above adds an extra term 
T(A{)t(A5). The correct formula actually reads 


lim (t(A,OA)O7A,0A50")), = t(AyAg)t(A})t(A5) + T(A)t (Ag) tT (Aj AS) 


N->0o 


— t(Ay)t(Ag)t(A4)t(A5), (12.10) 


which again is zero whenever all individual traces are zero (i.e. the freeness condition). 


12.1.3 Beyond Wick Contractions: Weingarten Functions 


Where does the last term in Eq. (12.10) come from? The solution to this puzzle lies in 
the fact that some subleading corrections to Eq. (12.7) also contribute to the trace we are 
computing: summing over indices from 1 to N can prop up some subdominant terms and 
make them contribute to the final result. Hence we need to know a little more about the 
Weingarten functions. This will allow us to conclude that the freeness condition holds for 
arbitrary n. The general Weingarten formula reads 


1Gj.1) = D Wr (1,0 )ôi noiro jojo + ** 8ix2n—1yix On) 9 fo On- io ny? 
pairings 1,0 


(12.11) 


where now the pairings z of i’s and ø of j’s do not need to coincide. The Weingarten 
functions W, (7,0) can be thought of as matrices with pairings as indices. They are given 
by the pseudo-inverse! of the matrices My (1,0) := NEO), where (zr, c) is the number 
of loops obtained when superposing z and o. For example, when m = o one finds n 
loops, each of length 2, see Figure 12.1. While when z + o the number of loops is always 
less than n (€(77,0) < n), see Figure 12.2. At large N, the diagonal of the matrix My 
dominates and the matrix is always invertible. By expanding in powers of 1/N, we see 
that its inverse Wn, whose elements are the Weingarten functions, has an N~” behavior 


! The pseudo-inverse of M is such that WMW = W and MWM = M. When M is invertible, W = M7! If Mis 
diagonalizable, the eigenvalues of W are the reciprocal of those of M with the rule 1/0 — 0. 
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Figure 12.2 Number of loops €(z,o) for x = (1,4)(2,3) and o = (1,2)(3,4). In this example there 
is only one loop. 


on the diagonal and off-diagonal terms are at most N ="-1, More generally, one has the 
following beautiful expansion: 


lo) 
Wp (r, 0) = NET 0) —2n 5 Qg (r, o)NTS, (12.12) 
g=0 


where the coefficients Qg(,o) depend on properties of certain “geodesic paths” in the 
space of partitions. The bottom line of this general expansion formula is that non-Wick 
contractions are of higher order in N ml. 

As an illustration, let us come back to the missing term in Eq. (12.9) when one 
restricts to Wick contractions, for which the number of loops €(z,2) = 2. The next term 
corresponds to (2,0) = 1 for which W7(z,0) ~ -N73 (see Exercise 12.1.1). Consider 
the pairings i} = i4, i2 = i3 and jj = j2, j3 = j4, for which €(7,0) = 1 (see Fig. 12.2). 
Such pairings do not add any constraint on the 2n = 4 free indices that appear in 
Eq. (12.3); summing over them thus yields T(A1)T(A2)t(A4)t(A5), with a —1 coming 
from the Weingarten function Wz (7,0). Hence we recover the last term of Eq. (12.10). 


Exercise 12.1.1 Exact Weingarten functions at n = 2 
There are three possible pair partitions of four elements. They are shown 
on Figure 3.2. If we number these 71,72 and 73, M> is a3 x 3 matrix whose 


elements are equal to N (mij), 


(a) By trying a few combinations of the three pairings, convince yourself that 
forn = 2, L(x; xj) = 2 ifi = j and 1 otherwise. 

(b) Build the matrix M2. For N > 1 it is invertible, find its inverse W2. Hint: 
use an ansatz for W) with a variable a on the diagonal and b off-diagonal. 
For N = 1 find the pseudo-inverse of M2. 

(c) Finally show that (when N > 1) of the nine pairs of pairings, the three 
Wick contractions (diagonal elements) have 


i N+1 Noo 
wyVick — Nee, 12.13 
: N3 + N2 —2N eee 
and the six non-Wick parings (off-diagonal) have 
i 1 N- œ = 
Wee = Ne 12.14 
2 N3 + N2? — 2N ae 


For N = 1, all nine Weingarten functions are equal to 1/9. 

(d) The expression (t(0O7 OO" )) is always equal to 1. Write it as a sum over 
four indices and expand the expectation value over orthogonal matrices as 
nine terms each containing two Dirac deltas multiplied by a Weingarten 
function. Each sum of delta terms gives a power of N; find these for all 
nine terms and using your result from (c), show that indeed the Weingarten 
functions give (t(0O7007)) = 1 for all N. 


12.2 R-Transforms and Resummed Perturbation Theory 


12.1.4 Freeness of Large Matrices 


We are now ready to show that all expectations of the form (12.3) are zero. Let us look 
at them more closely: 


(t (A1041 0" A20450"... AnOA/,O")) 
1 . / / / 
=W Ss TAS MIA ipi [A2igiz «++ An lizn rin Atja Aaja o Andon ion 
ij 
(12.15) 
The object I (i,j,n) contains all possible pairings of the i indices and all those of the j 
indices with a Weingarten function as its prefactor. The i and j indices never mix. We 


concentrate on i pairings. Each pairing will give rise to a product of normalized traces of 
Aj; matrices. For example, the term 


T(A5)T(A44A142)T (4346) (12.16) 


would appear for n = 6. Each normalized trace introduces a factor of N when going from 
Tr(.) to t(.). Since by hypothesis t(A;) = 0, the maximum number of non-zero traces is 


[n/2], e.g. 
T(A1A3)tT(A5A6)T(A2 Ay). (12.17) 
The maximum factor of N that can be generated is therefore N”/2!. Applying the same 


reasoning to the j pairing, and using the fact that the Weingarten function is at most 
O(N—"), we find 

N-00 0. 
(12.18) 


|(t (4104107 A250"... AnOA/,0")}g| < O (N71+21/21=") 


Using the same arguments one can shown that large unitary invariant complex Her- 
mitian random matrices are free. In this case one should consider an integral of uni- 
tary matrices in Eq. (12.4). The result is also given by Eq. (12.11) where the functions 
Wn (7,0) are now unitary Weingarten functions. They are different than the orthogonal 
Weingarten functions presented above but they share an important property in the large N 
limit, namely 


OWN") if x =0, 


Wr(z,0) = j 
nmo) O(N") if x #0 for some k > 1, 


(12.19) 


which was the property needed in our proof of freeness of large rotationally invariant 
symmetric matrices. 


12.2 R-Transforms and Resummed Perturbation Theory 


In this section, we want to explore yet another route to obtain the additivity of R- 
transforms, which makes use of perturbation theory and of the mixed cumulant calculus 
introduced in the last chapter, Section 11.3.6, exploited in a concrete case. 

We want to study the average Stieltjes transform of A + BR where BR is a randomly 
rotated matrix B: BR := OBO". We thus write 


g) := (z (c py om B Eeg (a = BR)! l (12.20) 
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where tp is meant for both the normalized trace t and the average over the Haar measure 
of the rotation group. We now formally expand the resolvent in powers of BÈ. Introducing 
Ga = (z1 — A)7!, one has 


gz) = tR(GA) + TR(GABRG 4) + tR(GABRGABRGA) +-+. (12.21) 


Now, since in the large N limit G 4 and BÈ are free, we can use the general tracial formula, 
Eq. (11.74), noting that all mixed cumulants (containing both G 4 and BÈ) are identically 
Zero. 

In order to proceed, one needs to introduce three types of mixed moments where BR 
appears exactly n times: 


mP := tp(GaBRG,4...G4BRG,), mÊ := tR(BRG4...G 4B) (12.22) 
and 


m®) := tr(BRG,... BBG 4) = tR(GABÈ... G4BB). (12.23) 


Note that mP = g4 (z) and mo? = mO 
corresponding generating functions: 


= 0. We also introduce, for full generality, the 


lee) 
MO) =o mOPu", a =1,2,3. (12.24) 
n=0 


Note however that we will only be interested here in g(z) := M Om = 1) (cf. Eq. 
(12.21)). 

Let us compute mP using the same method as in Section 11.3.5, i.e. expanding in the 
size £ of the group to which the first G 4 belongs (see Eq. (11.71)): 


n+l 
1 2) (2 2 3 
mP =Y «o,e Į] mO mO... mO mO, (12.25) 
taal ki, ka,...,ke=0 


ky +kgt--+ke=n—€ 


where n > 1 and xg, ¢ are the free cumulants of G 4. Similarly, 


n 
2 1 1 1 3 
mP =F «pe Il Hig he stn mO (12.26) 
t=1 ky, ko,...,ke=0 
ky +kg+---+ke=n—£ 
and 
3 ga 1 1 1 

mo) =Y «pe Il mO m aati (12.27) 

£=1 ky, k2,...,ke=0 


ky +kp+--+ke=n—€ 
where «g ¢ are the free cumulants of B. Multiplying both sides of these equations by u” 


and summing over n leads to, respectively, 


0O 
MO u) = ga) + X key, eu IMO UT MO u), (12.28) 
t=1 
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and 


[0,0] (oe) 
MOU) = Y «peu MO (uy! MO), MO) = Yep eu MO wy’ 
l=1 t=1 
(12.29) 


Recalling the definition of R-transforms as a power series of cumulants, we thus get 


MY (u) = 94 + uM WRG, (ud w) (12.30) 
and 
M2 (uv) =uM® (u)Rp (um), MO u) = uM (u) Rp (ut w) 
(12.31) 


Eliminating M ® (u) and M®)(w) and setting u = 1 then yields the following relation: 


a(z) = 9a(2) + 92) Ra (9(2)) RG, (0RR 0O). (12.32) 


In order to rewrite this result in more familiar terms, let us consider the case where B = b1, 
in which case Rg(z) = band, since A+B = A+b1, 9(z) = g4 (z—b). Hence, for arbitrary 
b, RG, must obey the relation 


z b 
RG, (ba(z)) = sain, (12.33) 


Now, if for a fixed z we set b = Rp(g(z)), we find that Eq. (12.32) is obeyed provided 
a(z) = g4 (z — Rpg(g(z))), i.e. precisely the subordination relation Eq. (11.80). 
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In the last chapter, we briefly discussed the extension of the CLT for non-commuting vari- 
ables. We now restate the result in the context of random matrices, with a special focus on 
the preasymptotic (cumulant) corrections to the Wigner distribution. 

Let us consider the following sum of K large, randomly rotated matrices, all assumed to 
be traceless: 


K 
1 
Mx := TE 2 O;A;07, TrA; = 0, (12.34) 
where O; are independent, random rotation matrices, chosen with a flat measure over 


the orthogonal group O(N). We also assume, for simplicity, that all A; have the same 
(arbitrary) moments: 


t(AÍ) = me, Vi. (12.35) 


This means in particular that all A;’s share the same R-transform: 


[0.0] 
Ra, (z) = eae (12.36) 
l=2 
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Using the fact that R-transforms of randomly rotated matrices simply add, together with 
Rym(z) = Rm (@z), one finds 


OO 
Ryg @ = X KP kzt. (12.37) 
l=2 


We thus see that, as K becomes large, all free cumulants except the second one tend to 
zero, which implies that the limit of Mx when K goes to infinity is a Wigner matrix, with 
a semi-circle eigenvalue spectrum. 

It is interesting to study the finite K corrections to this result. First, assume that the 
spectrum of A; is not symmetric around zero, such that the skewness m3 = x3 + 0. For 
large K, the R-transform of Mx can be approximated as 

Rag (Z) © 022 + et Bose (12.38) 


In order to derive the corrections to the semi-circle induced by skewness, we posit that the 
Stieltjes transform can be expanded around the Wigner result gx(z) as 


am, (2) = ax(z) + wae Pe (12.39) 


and assume the second term to be very small. The R-transform, Rmx (z), can be similarly 
expanded, yielding 


Rx (xo + no) T _Rax@yStS Bw) ==22. aoa 
VK VK Gy (Z) 
or, equivalently, 
93(z) = —9x (2) R3(9x(Z)) = — 9x @)9X(@). (12.41) 
For simplicity, we normalize the Wigner semi-circle such that o? = 1, and hence 
_! S/R) - ey) 
9x(z) = 5 EO A); Ais 7-4, (12.42) 


One then finds 


=. Ae z VA 2 ® 
93(z) = i( ~ Jl 2 z VA). (12.43) 


The imaginary part of this expression when z —> A + i0 gives the correction to the semi- 
circle eigenvalue spectrum, and reads, for A € [—2, 2], 

k3 A(A* — 3) 
Qn JK J4—22 


This correction is plotted in Figure 12.3. Note that it is odd in A, as expected, and diverges 
near the edges of the spectrum, around which the above perturbation approach breaks down 


(12.44) 
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dp (A) 


Figure 12.3 First order correction to the Wigner density of eigenvalues for non-zero skewness 
(603(A)) and kurtosis (604(A)), Eqs. (12.44) and (12.47), respectively. 


(because the density of eigenvalues of the Wigner matrix goes to zero). Note that the excess 
skewness is computed to be 
K3 


2 
Msp(A)dA = —, 12.45 
ie pada = (12.45) 


as expected. 

One can obtain the corresponding correction when A; is symmetric around zero, in 
which case the first correction term comes from the kurtosis. For large K, the R-transform 
of Mx can now be approximated as 


Ro, (Z) © 022 + ~ MEN (12.46) 


Following the same path as above, one finally derives the correction to the eigenvalue 
spectrum, which in this case reads, for à € [—2,2], 


k4 At — 44242 


p (à) = 12.47 
p (A) aK A ( ) 
(see Fig. 12.3). The correction is now even in à and one can check that 
2 
| ôp(à)dà = 0, (12.48) 
-2 


as it should be, since all the mass is carried by the semi-circle. Another check that this 
result is correct is to compute the excess free kurtosis, given by 


K4 


2 
I OF =I pid = —, (12.49) 
g K 


again as expected. 
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These corrections to the free CLT are the analog of the Edgeworth series for the classical 
CLT. In full generality, the contribution of the nth cumulant to 6 (à) reads 


kn __™(3) 
nKn/2-1 [qa 2” 


where T,, (x) are the Chebyshev polynomials of the first kind. 


ôPn (À) = 


(12.50) 


12.4 Finite Free Convolutions 


We saw that rotationally invariant random matrices become asymptotically free as their 
size goes to infinity. At finite N freeness is not exact, except in very special cases (see 
Section 12.5). In this section we will discuss operations on polynomials that are analogous 
to the free addition and multiplication but unfortunately lack the full set of properties 
of freeness as defined in Chapter 11. When the polynomials are thought of as expected 
characteristic polynomials of large matrices, finite free addition and multiplication do 
indeed converge to the free addition and multiplication when the size of the matrix (degree 
of the polynomial) goes to infinity. 


12.4.1 Notations: Roots and Coefficients 


Let p(z) be a polynomial of degree N. By the fundamental theorem of algebra, this 
polynomial will have exactly N roots (when counted with their multiplicity) and can be 
written as 


N 
p =a | [e — Ai). (12.51) 
i=1 
If ag = 1 we say that p(z) is monic and if all the ;’s are real the polynomial is called 
real-rooted. In this section we will only consider real-rooted monic polynomials. Such a 
polynomial can always be viewed as the characteristic polynomial of the diagonal matrix 
A containing its roots, i.e. 


p(z) = det (z1— A). (12.52) 
We can expand the product (12.51) as 
N 
Pz) =X CD ag. (12.53) 
k=0 


Note that we have defined the coefficient ag as the coefficient of zN—* and not that of z* 
and that we have included alternating signs (—1)* in its definition. The reason is that we 
want a simple link between the coefficients ag and the roots A;. We have 


N 
ak= JO ihi Aip (12.54) 
ordered 
k-tuples i 
N N N 
a=l, q=) i, m= J hij o an=] [> (12.55) 


i=1 i=l i=l 
j=i+1 
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Note that the coefficient ag is homogeneous to AŽ. From the coefficients ag we can 
compute the sample moments? of the set of 4;’s. In particular we have 


A a d 2 x a? 2a2 
u({ài)) = y ™ r ({Ai}) = NO NU 
The polynomial p(z) will often be the expected characteristic polynomial of some 
random matrix M of size N with joint distribution of eigenvalues P ({u;}). The coef- 
ficients a; are then multi-linear moments of this joint distribution. If the random 
eigenvalues u; do not fluctuate, we have 4; = pm; but in general we should think 
of the 4;’s as deterministic numbers fixed by the random ensemble considered. The 
case of independent eigenvalues gives a trivial expected characteristic polynomial, 
ie. p(z) = (z — E(u)) or else à; = E(u) Vi. 

Shifts and scaling of the matrix M can be mapped onto operations on the polynomial 
PM(z). For a shift, we have 


(12.56) 


PM+o1 (2) = PM(Z — Q). (12.57) 


Multiplication by a scalar gives 


Pam(z) = a pMa !z) > ag = ok ag. (12.58) 
Finally there is a formula for matrix inversion which is only valid when the eigenvalues 
are deterministic (e.g. M = OAO? with fixed A): 


N 


z aN-k 
Py- £) = ay PMO <> a = . 


aN 


(12.59) 


A degree N polynomial can always be written as a degree N polynomial of the deriva- 
tive operator acting on the monomial z^ . We introduce the notation pas 


k! 
p(z) =: P (dz)z the coefficients of p are âp = CDY JIN- (12.60) 
where dz is a shorthand notation for d/dz. It will prove useful to compute finite free 
convolutions. 


12.4.2 Finite Free Addition 


The equivalent of the free addition for two monic polynomials pı (x) and p2(x) of the 
same degree N is the finite free additive convolution defined as 


pı B p2(z) = (det[z1 — Ay —OA20"])g, (12.61) 


where the diagonal matrices A, 7 contain the roots of pı 2(z) all assumed to be real. The 
averaging over the orthogonal matrix O is as usual done over the flat (Haar) measure. We 
could have chosen to integrate O over unitary matrices or even permutation matrices; the 
final result would be the same. 

As we will show in Section 12.4.5, the additive convolution can be expressed in a very 
concise form using the polynomials py (x): 


A A x A N 
pı Œ p2(z) = Py (dz) p2(z) = P2 (dz) p1 @) = P1 (dz) P2 (dz) z”. (12.62) 
2 Here we have chosen to normalize the sample variance with a factor (N — 97L. It may seem odd to use the formula suited to 


when the mean is unknown for computing the variance of deterministic numbers, but this definition will later match that of the 
finite free cumulant and give the Hermite polynomial the variance of the corresponding Wigner matrix. 
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It is easy to see that when ps(z) = pı Œ p2(z), ps(z) is again a monic polynomial of 
degree N. What is less obvious, but true, is that ps(x) is also real-rooted. A proof of this 
is beyond the scope of this book. The additive convolution is bilinear in the coefficients of 
pı(z) and p2(z), which means that the operation commutes with the expectation value. If 
P1,2(Z) are independent random polynomials (for example, characteristic polynomials of 
independent random matrices) we have a relation for their expected value: 


E[p1  p2(z)] = Elp (z) HEL po (z)]. (12.63) 


The finite free addition can also be written as a relation between the coefficients a® 


of ps(z) and those of p1,2(2): 


(s)_ (N -i)!(N— J)! o O 
k se err a ay". (12.64) 
i+j= 


More explicitly, the first three coefficients are given by 


a® = aP 4 a, (12.65) 
N-1 
= a +0 p Taa. 


From which we can verify that both the sample mean and the variance (Eq. (12.56)) are 
additive under the finite free addition. 

If we call py (z) = (z — w) the polynomial with a single root u so po(z) = z is the 
trivial monic polynomial, we have that, under additive convolution with any p(z), po(z) 
acts as a null element and p, (z) acts as a shift: 


PH po(z) = p(z) and pH py(z) = p(z— u). (12.66) 


Hermite polynomials are stable under this addition: 


Hy E Hy) = 2%? Hy 27.2), (12.67) 


where the factors 2/2 compensate the doubling of the sample variance. 

The average characteristic polynomials of Wishart matrices can easily be understood 
in terms the finite free sum. Consider an N-dimensional rank-1 matrix M = xx’ where x 
is a vector of 1D unit variance numbers. It has one eigenvalue equal to N (on average) and 
all others are zero, so its average characteristic polynomial is 


pE) =le- N) =U —d) 2%, (12.68) 


from which we read that (dz) = (1 — dz). But since an unnormalized Wishart matrix of 
parameter T is just the free sum of T such projectors, Eq. (12.62) immediately leads to 
pr(z) = (1 — dz)? 2%, which coincides with Eq. (6.41). 


12.4.3 A Finite R-Transform 


If we look back at Eq. (12.62), we notice that the polynomial p(d,) behaves like the 
Fourier transform under free addition. Its logarithm is therefore additive. We need to be 
a bit careful of what we mean by the logarithm of a polynomial. The function p(dz) has 
two important properties: first as a power series in d; it always starts 1 + O(d,); second it 
is defined by its action on Nth order polynomials in z, so only its N + 1 first terms in its 


Taylor series matter. We will say that the polynomial is defined modulo aN HL meaning 


12.4 Finite Free Convolutions 


that higher order terms are set to zero. When we take the logarithm of p(w) we thus mean: 


apply the Taylor series of the logarithm around 1 and expand up to the power u™ . The 
resulting function 


L(u) := — [log ĵ(u)] mod u™+! (12.69) 


is then additive. For average characteristic polynomials of random matrices, it should be 
related to the R-transform in the large N limit. 
Let us examine more closely L (u) in three simple cases: identity, Wigner and Wishart. 


e For the identity matrix of size N we have 


N N/N k N-k Gat N N 
pz) =(- 1) => (pen z a ee = exp(—dz)z”". 
k=0 k=0 
(12.70) 
So we find 
pi(u) = [exp(—u)] mod utl > Ly(u) =u. (12.71) 


e For Wigner matrices, we saw in Chapter 6 that the expected characteristic polynomial 
of a unit Wigner matrix is given by a Hermite polynomial normalized as 


1 ay? 
px) = NN Hy (VN?) = exp | -— (—) 120, (12.72) 
2N \dz 
where the right hand side comes from Eq. (6.8). We then have 
2 2 
xu) = | exp- || modut! => Lx) =. (12.73) 
2N 2N 


¢ For Wishart matrices, Eq. (6.41) expresses the expected characteristic polynomial as 


a derivative operator acting on the monomial z^. Fora correctly normalized Wishart 
matrix, we then find the monic Laguerre polynomial: 


EINEN 


from which we can immediately read off the polynomial p(z): 


N 
Pwu) = (i — T) ‘| mod uN +! 


x (12.75) 
= Liu) =-7 [iog (1 z *) mod u+! 


We notice that in these three cases, the L function is related to the corresponding limiting 
R-transform of the infinite size matrices as 
L'(u) =[R(u/N)] mod u™+!. (12.76) 


The equality at finite N holds for these simple cases but not in the general case. But the 
equality holds in general in the limiting case: 


lim L'(Nx) = R(x). (12.77) 
N—> œ 


189 


190 Free Random Matrices 


12.4.4 Finite Free Product 


The free product can also be generalized to an operation on a real-rooted monic poly- 
nomial of the same degree. We define 


pı X p2(z) = (det[zl — AyOA20"})g, (12.78) 


where as usual the diagonal matrices A, 2 contain the roots of pı 2(z). As in the additive 
case, averaging the matrix O over the permutation, orthogonal or unitary group gives the 
same result. 

We will show in Section 12.4.5 that the result of the finite free product has a simple 
expression in terms of the coefficient ag defined by (12.53). When pm(z) = pı X p2(z) 


we have 
NT! œe 
a™ = I) apa”, (12.79) 


Note that if A; = a1 is a multiple of the identity with a + 0 we have 


N\ k 
Pa=-a)N > a) = (NA (12.80) 


Plugging this into Eq. (12.79), we see that the free product with a multiple of the identity 
multiplies each ag by a which is equivalent to multiplying each root by œ. In particular 
the identity matrix (æ = 1) is the neutral element for that convolution. 

When pm(z) = pı XI po(z), the sample mean of the roots of pm(z) (Eq. (12.56)) 
behaves as 


Ham) = UOH): 


When both means are unity, we have for the sample variance 


2 


2 gee 
010, 
2 a2 ()%Q) 
ony ay a 


- (12.81) 


At large N the last term becomes negligible and we recover the additivity of the variance 
for the product of unit-trace free variables (see Eq. (11.85)). 


Exercise 12.4.1 Free product of polynomials with roots 0 and 1 
Consider an even degree N = 2M polynomial with roots 0 and 1 both with 
multiplicity M: p(z) = z“(z—1)™. We will study the finite free product of this 
polynomial with itself pm(z) = pX p(z). We will study the large N limit of this 
problem in Section 15.4.2. 


(a) Expand the polynomial and write the coefficients ag in terms of binomial 
coefficients. 
(b) Using Eq. (12.79) show that the coefficients of pm(z) are given by 


=i 
a™ > Ey K when k < M, 
— 


(0) otherwise. 


(12.82) 


(c) The polynomial pm(z) always has zero as a root. What is its multiplicity? 
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(d) The degree M polynomial g(z) = z~™ pm(z) only has non-zero roots. What 
is its average root? 

(e) For M = 2, show that the two roots of q (z) are 1/2 + 1/ TB, 

(f) The case M = 4 can be solved by noticing that q (z) is a symmetric function 
of (z — 1/2). Show that the four roots are given by 


1 1 /15+2/30 
iat, = oe : (12.83) 
2° 2 35 


12.4.5 Finite Free Convolutions: Derivation of the Results 


We will first study the definition (12.61) when the matrix O is averaged over the 
permutation group Sy. In this case we can write 


N 
1 1 2 
ps2) = PiBpo@)= >, D Me- -so (12.84) 
permutations i=1 
o 


The coefficients a® 


are given by the average over the permutation group of the coeffi- 


cients of the polynomial with roots a + DSA }. For example for a we have 
(s)_ | (0) 42) 
S 
a =N D DD Q i io) 
permutations i 
o 
a 1 2 1 2 
=P (ay? +4 ) = af) + ay. (12.85) 

i 


For the other coefficients ag, the combinatorics is a bit more involved. Let us first figure 
out the structure of the result. For each permutation we can expand the product 


(caf? -12 ) (cag? — ay) eap ayy). (12.86) 


To get a contribution to ag we need to choose the variable z (N — k) times, we can then 
choose a A“) i times and a a (k — i) times. For a given k and for each choice of i, once 
averaged over all permutations, the product of the 2) and that of the A) must both be 
completely symmetric and therefore proportional to aP a ;- We thus have 


k 
a® =Y Chik, Naa (12.87) 


i Ka 
i=0 


where C(i,k, N) are combinatorial coefficients that we still need to determine. There is 
an easy way to get these coefficients: we can use a case that we can compute directly and 
match the coefficients. If Az = 1, the identity matrix, we have 


aAw@=c-DN > a? =(7). (12.88) 
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For a generic polynomial p(z), the free sum with p(z) is given by a simple shift in the 
argument z by 1: 


N k ; 
1 = 5 N-i 1 
ps) = pe-)=F CaP- > a= o E ‘al 3 
k=0 i=0 
(12.89) 
Combining Eqs. (12.87) and (12.88) and matching the coefficient to (12.89), we arrive at 


k . F 
(s) (N—-i)!(N-—k+i)! m o 
ak =) 


N!(N —k)! a; akis (12.90) 


i=0 

which is equivalent to Eq. (12.64). For the equivalence with Eq. (12.62) see 
Exercise 12.4.2. 

Now suppose we want to average Eq. (12.61) with respect to the orthogonal or unitary 

group (O(N) or U(N)). For a given rotation matrix O, we can expand the determinant, 

keeping track of powers of z and of the various A that appear in products containing at 


most one of each A” After averaging over the group, the combinations of 2) must be 


permutation invariant, i.e. proportional to al ), G ); 


a = 5 C (ik, {2 1) af, (12.91) 
i=0 


where the coefficients C (i. k, N, he \) depend on the roots AO). By dimensional anal- 


; we then get for the coefficient a; 


ysis, they must be homogeneous to (A® Ki, Since the expression must be symmetrical 
in (1) <> (2), it must be of the form (12.87). And since the free addition with the 
unit matrix is the same for all three groups, Eq. (12.64) must be true in all three cases 
(Sy, O(N) and U(N)). 

We now turn to the proof of Eq. (12.79) for the finite free product. Consider Eq. (12.78), 
where the matrix O is averaged over the permutation group Sy. For pm(z) = pı XI p2(z) 
we have 


N 
1 1.2 1 
mosz È [TM as) =a ÈE pe. 029 
` permutations i=1 ern aon 
oO oO 


For a given permutation o, the coefficients af are given by 


N 
= d), (1) (),@2) ,@ (2) 
ag = OA A eGo) AGN (12.93) 
ordered 
k-tuples i 


After averaging over the permutations o, we must have that a® x aae, By counting 


the number of terms in the sum defining each ag, we realize that the proportionality 
constant must be one over this number. We then have 


m) MNT w o 
=y ae ae (12.94) 


We could have derived the proportionality constant by requiring that the polynomial 
pı(z) = (z — DY is the neutral element of this convolution. 
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Exercise 12.4.2 Equivalence of finite free addition formulas 
For a real-rooted monic polynomial pı(z) of degree N, we define the 
polynomial pı (wu) as 


piu) = y O (12.95) 


where the coefficients af ) are given by Eq. (12.53). 
(a) Show that 
-diy 
pı(z) = Pı e (12.96) 
(b) For another polynomial p2(z), show that 
A k SOT ER EO O ie 
Ae ~) 20 = ye 1) ys ge eee oe 


(12.97) 


which shows the equivalence of Eqs. (12.62) and (12.64). 


12.5 Freeness for 2 x 2 Matrices 


Certain low-dimensional random matrices can be free. For N = 1 (1 x 1 matrices) all 
matrices commute and thus behave as classical random numbers. As mentioned in Chapter 
11, freeness is trivial for commuting variables as only constants (deterministic variables) 
can be free with respect to non-constant random variables. 

For N = 2, there exist non-trivial matrices that can be mutually free. To be more precise, 
we consider the space of 2 x 2 symmetric random matrices and define the operator? 


r(A) = SEITrA\ (12.98) 


We now consider matrices that have deterministic eigenvalues but random, rotationally 
invariant eigenvectors. We will see that any two such matrices are free. Since 2 x 2 matrices 
only have two eigenvalues we can write these matrices as 


A= a+o0(, = O’, (12.99) 


3 More formally, we need to consider 2 x 2 symmetric random matrices with finite moments to all orders. This space is closed 
under addition and multiplication and forms a ring satisfying all the axioms described in Section 11.1. 
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where a is the mean of the two eigenvalues and o their half-difference. The matrix O is a 
random rotation matrix which for N = 2 only has one degree of freedom and can always 
be written as 


O= oe sin 0 (12.100) 
—sin@ cosd@ 


where the angle 6 is uniformly distributed in [0,27]. Note again that we are considering 
matrices for which a and o are non-random. If we perform the matrix multiplications and 
use some standard trigonometric identities we find 


cos26 sin28 ) 


one ee E — cos 20 


(12.101) 


12.5.1 Freeness of Matrices with Deterministic Eigenvalues 


We can now show that any two such matrices A and B are free. If we put ourselves in 
the basis where A is diagonal, we see that traceless polynomials p;,(A) and q(B) are 
necessarily of the form 


(12.102) 


sin20 —cos20 


P(A) = ak (( a and q(B) = by (se 20 = sin 20 ) . 


for some deterministic numbers ag and by and where 0 is the random angle between the 
eigenvectors of A and B. We can now compute the expectation value of the trace of products 
of such polynomials: 


= i f— cos20 sin20\" 
I TI rida =5 (i at à Tr 6 Boge ee a (12.103) 
k=1 


k=1 


We notice that the matrix on the right hand side is a rotation matrix (of angle 20) raised 
to the power n and therefore itself a rotation matrix (of angle 2n@). The average of such a 
matrix is zero, thus finishing our proof that two rotationally invariant 2 x 2 matrices with 
deterministic eigenvalues are free. 

As a consequence, we can use the R-transform to compute the average eigenvalue 
spectrum of A + OBO’ when A and B have deterministic eigenvalues and O is a 
random rotation matrix. In particular if A and B have the same variance (øo) then this 
spectrum is given by the arcsine law that we encountered in Section 7.1.3. For positive 
definite A we can also compute the spectrum of “AOBO? VA using the S-transform (see 
Exercises 12.5.1 and 12.5.2). 


Exercise 12.5.1 Sum of two free 2 x 2 matrices 


(a) Consider A, a traceless rotationally invariant 2 x 2 matrix with deterministic 
eigenvalues A. = +ø. Show that 
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V¥1+4o2g2 —1 
2g i 
(b) Two such matrices A; and A» (with eigenvalues +o; and +072 respectively) 
are free, so we can sum their R-transforms to find the spectrum of their sum. 
Show that gA; +4; (2) is given by one of the roots of 


IER 
GA) +A2(Z) = : (12.105) 


i (o4 +04) — (0? + 02)z2 +z 


2 
ga (z) = eae and R(g) = (12.104) 


(c) Inthe basis where A, is diagonal A; + Ag has the form 


Ae (" + 07 cos 26 o7 sin 20 ) 


osin290  —o; — 02 cos 26 (12.106) 


for a random angle 0 uniformly distributed between [0,27]. Show that the 
eigenvalues of A; + Ao are given by 


ha = + 0? + 02 + 20102 cos 20. (12.107) 


(d) Show that the densities implied by (b) and (c) are the same. For o1 = 02, it is 
called the arcsine law (see Section 7.1.3). 


Exercise 12.5.2 Product of two free 2 x 2 matrices 
A rotationally invariant 2 x 2 matrix with deterministic eigenvalues 0 and 
a, > O has the form 


A =0(5 Jor. (12.108) 
0 a 


Two such matrices are free so we can use the S-transform to compute the 
eigenvalue distribution of their product. 
(a) Show that the T-transform and the S-transform of A; are given by 
2 t+1 
TEN eR a a ee 
2(¢ — aj) a, 2t+1 


(b) Consider another such matrix Az with independent eigenvectors and non-zero 
eigenvalue a2. Using the multiplicativity of the S-transform, show that the 
T-transform and the density of eigenvalues of ./A;A2 VA] are given by 


1 1 
Woy = $ (12.110) 
2 2Ņ¢-—aa 


1 1 
à) = -ô fr0 <à < 5 12.111 
PA; Ao ( ) 7 ( ) ar an Mee = 7) S&A < aja ( ) 


(12.109) 


and 
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where the delta function in the density indicates the fact that one eigenvalue 
is always zero. 

(c) By directly computing the matrix product in the basis where Aj is diagonal, 
show that, in that basis, 


0 0 
JAIA VAI = aja ( TE = (12.112) 


where 0 is a random angle uniformly distributed between 0 and 27. 
(d) Show that the distribution of the non-zero eigenvalue implied by (b) and (c) is 
the same. It is the shifted arcsine law. 


12.5.2 Pairwise Freeness and Free Collections 


For the 2 x 2 matrices A and B to be free, they need to have deterministic eigenvalues. 
The proof above does not work if the eigenvalues of one of these matrices are random. As 
an illustration, consider three matrices A, B and C as above (2 x 2 symmetric rotationally 
invariant matrices with deterministic eigenvalues). Each pair of these matrices is free. On 
the other hand, the free sum A + B has random eigenvalues and it is not necessarily free 
from C. Actually we can show that it is not free with respect to C. 

First we show that A, B and C do not form a free collection. For simplicity, we consider 
them traceless with oa = og = oc = 1. Then one can show that 


t (ABCABC) = 1((ABC)’) = 1, (12.113) 


violating the freeness condition for three variables. Indeed, let us compute explicitly ABC 
in the basis where A is diagonal. We find 


cos 20 cos 2ġ + sin26sin2¢  cos26sin2¢@ — sin 20 cos 2¢ 


ABC = 
E a 20 sin2ġ — sin 28 cos2ọġ — cos 20 cos 2¢ġ — sin 20 sin2¢ 


). (12.114) 


where ¢ is the angle between the eigenvectors of A and those of C. The matrix ABC is a 
non-zero symmetric matrix, so the trace of its square must be non-zero. Actually one finds 
(ABC) = 1. 

Another way to see that A, B and C are not free as a group is to compute the mixed 
cumulant «6 (A, B, C, A,B,C), given that odd cumulants such as «;(A) and «3(A,B,C) 
are zero and that mixed cumulants involving two matrices are zero (they are pairwise 
free). The only non-zero term in the moment-cumulant relation for t (ABCABC) (see 
Eq. (11.74)) is 


t (ABCABC) = «s (A,B,C, A,B,C) = 1. (12.115) 
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The matrices A, B and C have therefore at least one non-zero cross-cumulant and cannot 
be free as a collection. 

Now, to show that A + B is not free from C, we realize that if we expand the sixth cross- 
cumulant ks (A + B,A+ B,C, A + B,A + B,C), we will encounter the above non-zero 
cross-cumulant of A, B and C. Indeed, a cross-cumulant is linear in each of its arguments. 
Since all other terms in this expansion are zero, we find 


k6(A + B,A + B,C,A+B,A+ B,C) 
= K6 (A,B,C, A, B, C) + «e (B, A, C, A, B, C) 
+ x6 (A, B, C, B, A, C) + x6 (B, A,C,B,A,C) = 4 #0. (12.116) 


As a consequence, even though 2 x 2 symmetric rotationally invariant random matrices 
with deterministic eigenvalues are pairwise free, they do not satisfy the free cLT. If they 
did, it would imply that 2 x 2 matrices with a semi-circle spectrum would be stable under 
addition, which is not the case. Note that Gaussian 2 x 2 Wigner matrices (which are stable 
under addition) do not have a semi-circle spectrum (see Exercise 12.5.3). 


Exercise 12.5.3 Eigenvalue spectrum of 2 x 2 Gaussian matrices 
Real symmetric and complex Hermitian Gaussian 2 x 2 matrices are stable 
under addition but they are not free. In this exercise we see that their spectrum is 
not given by a semi-circle law. 


(a) Use Eq. (5.22) and the Gaussian potential V(x) = x? /2 to write the joint 
probability (up to a normalization) of A; and Az, the two eigenvalues of a real 
symmetric Gaussian 2 x 2 matrix. 

(b) ‘To find the eigenvalue density, we need to compute 


pai) = f dà2 P (åà1,à2). (12.117) 


This integral will involve an error function because of the absolute value in 
P(à1,à2). If you have the courage compute (A) leaving the normalization 
undetermined. 

(c) It is easier to do the complex Hermitian case. Use Eq. (5.26) and the same 
potential to adapt your answer in (a) to the 6 = 2 case. 

(d) The absolute value has now disappeared and the integral in (b) is now much 
easier. Perform this integral and find the normalization constant. You should 
obtain 

a 


poy = (12.118) 
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13 
The Replica Method* 


In this chapter we will review another important tool to perform compact computations in 
random systems and in particular in random matrix theory, namely the “replica method”. 
For example, one can use replicas to understand the R-transform addition rule when one 
adds two large, randomly rotated matrices. 

Suppose that we want to compute the free energy of a large random system. The free 
energy is the logarithm of some partition function Z.! We expect that the free energy does 
not depend on the particular sample so we can average the free energy with respect to 
the randomness in the system to get the typical free energy of a given sample. Unfor- 
tunately, averaging the logarithm of a partition function is hard. What we can do more 
easily is to compute the partition function to some power n and later let n —> 0 using the 
“replica trick”: 


_ 2-1 
log Z = lim : 
n—>0 n 


(13.1) 


The partition function Z” is just the partition function of n non-interacting copies of the 
same system Z, these copies are called “replicas”, hence the name of the technique. Averag- 
ing the logarithm is then equivalent to averaging Z” and taking the limit n — 0 as above. 
The averaging procedure will however couple the n copies and the resulting interacting 
system is in general hard to solve. In many interesting cases, the partition function can only 
be computed as the size of the system (say N) goes to infinity. Naturally one is tempted to 
interchange the limits (n —> 0 and N — ov) but there is no mathematical justification (yet) 
for doing so. Another problem is that we can hope to compute E[Z”] for all integers n but 
is that really sufficient to do a proper n — 0 limit? 

For all these reasons, replica computations are not considered rigorous. Nevertheless, 


they are a precious source of intuition and they allow one to obtain results mathematicians 
would call conjectures, but that often turn out to be mathematically exact. Although a lot 
of progress has been made to understand why the replica trick works, there is still a halo of 
mystery and magic surrounding the method, and a nagging impression that an equivalent 
but more transparent formalism awaits revelation. 


l See e.g. Section 13.4 for an explicit example. 
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In the present chapter, we will show how all the results obtained up to now can be 
rederived using replicas. We start by showing how the Stieltjes transform can be expressed 
using the replica method, and obtain once more the semi-circle law for Gaussian random 
matrices. We then discuss R-transforms and S-transforms in the language of replicas. 


13.1 Stieltjes Transform 


As we have shown in Chapter 2, the density of eigenvalues p(A) of a random matrix is 
encoded in the trace of the resolvent of that matrix, which defines the Stieltjes transform of 
p(A). Here we show how this quantity can be computed using the replica formalism. 


13.1.1 General Set-Up 


To use the replica trick in random matrix theory, we first need to express the Stieltjes 
transform as the average logarithm of a (possibly random) determinant. In the large N 
limit and for z sufficiently far from the real eigenvalues, the discrete Stieltjes transform 
gn (z) converges to a deterministic function 9(z). The replica trick will allow us to compute 
“Lgn (z)], which also converges to g(z). 

Using the definition Eq. (2.19) and dropping the N subscript, we have 


oe a 
NENO) = — pE z|: (13.2) 


k=1 


z 


whereas the determinant of z1 — A is given by 


N 
det(z1 — A) = [[« = AP), (13.3) 
k=1 


We can turn the product in the determinant into a sum by taking the logarithm and obtain 
(z — Ax)! from log(z — Ax) by taking a derivative with respect to z. We then get 


1 fa 
lsa O] = E E log det(z1 — a| (13.4) 


To compute the determinant we may use the multivariate Gaussian identity 


ae vey) A 13.5 
(any N72 P ( 2 )- JIM aa 


which is exact for any N as long as the matrix M is positive definite. For z larger than the 
top eigenvalue of A, (z1 — A) will be positive definite. The Gaussian formula allows us to 
compute the inverse of the square-root of the determinant, but we can neutralize the power 
— 1/2 by introducing an extra factor —2 in front of the logarithm. Applying the replica trick 
(13.1) we thus get 
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; al tim 2 — 2 13.6 
Lsa] = -2E | g ee ae | (13.6) 
with 
a T ya T piel- Apu 
Z =/Tl Orn oo 2 5 ; (13.7) 


where we have written Z” as the product of n copies of the same Gaussian integral. This 
is all fine, except our Z” is only defined for integer n and we need to take n —> 0. The 
limiting Stieltjes transform is defined as ga(z) = limy—>oo Elga (z)], but in practice the 
replica trick will allow us to compute 


A d >. we  E[Z”]-—1 
ga (z) = EFF lim lim ———__, (13.8) 


Z n>0 N->oo Nn 
and hope that the two limits n + 0 and N > oo commute, such that Ga (z) = ga (z). There 
is, however, no guarantee that these limits do commute. 


13.1.2 The Wigner Case 


As a detailed example of replica trick calculation, we now give all the steps necessary to 
compute the Stieltjes transform for the Wigner ensemble. We want to take the expectation 
value of Eq. (13.7) in the case where A = X, a symmetric Gaussian rotational invariant 
matrix: 


n dN y z n N N 1 N 
12") = fT] kat lew -FL v2 — DXi Vevey — 5 Xa Vava ||; 
a=1 a=1l i=1 i<j i 
n dy, z n A N i n 
-fI (anyN/2 P 3 2 2 Vai II a SXP -Xy 2 Vataj 
N 


x] [E e (-3% X tava) ; (13.9) 
i a=1 


where we have isolated the products of expectation of independent terms and separated the 
diagonal and off-diagonal terms. We can evaluate the expectation values using the following 
identity: for a centered Gaussian variable x of variance 07, we have 


Ble] = e72, (13.10) 


Using the fact that the diagonal and off-diagonal elements have a variance equal to, respec- 
tively, 207/N and o*/N (see Section 2.2.2), we get 


202 The Replica Method 


n dy 7 n N > N g n 2 
az= | gominevo(-2 vl) (Evans) 
a=1 a=1 


a=1i=1 i<j 


N 2 n 2 
x | [> | a (> tari (13.11) 
i a=1 


We can now combine the last two sums in the exponential into a single sum over {ij}, which 
we can further transform into 


N n 2 
= > (> huita) = a > (52a) (13.12) 


i, j=l a, B=1 


We would like to integrate over the variables Wg; but the argument of the exponential 
contains fourth order terms in the w’s. To tame this term, one uses the Hubbard— 


Stratonovich identity: 
q? 
ap (4 -) = i ew (-4 = +a) : (13.13) 


Before we use Hubbard-Stratonovich, we need to regroup diagonal and off-diagonal terms 
in ap: 


3I (£ wm) = 5 2 ai wm) +8 gee > (> enj. 


N azi a=1 


(13.14) 


where in the diagonal terms we have pushed the factor 1/4 in the squared quantity for later 
convenience. We can now use Eq. (13.13), introducing diagonal qog and upper triangular 
qap to linearize the squared quantities. Writing the g’s as a symmetric matrix q we have 


n] dy, a (zdap — Aap ea 
[Z] «foo fT (Qn yw +X 2 ; 


i=l a,p=1 


(13.15) 


where dq is the integration over the independent component of the n x n symmetric 
matrix q; note that we have dropped z-independent constant factors. The integral of pq; 
is now a multivariate Gaussian integral, actually N copies of the very same n-dimensional 
Gaussian integral: 


J I ve exp- 2 ce = (det(zl—q))'/*. (13.16) 
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Raising this integral to the Nth power and using det M = exp Tr log M we find 


Z” d NTL + eet = Ja Nf 
[ 1x f THI (z+ d) | z! T w): 


(13.17) 


We now fix n and evaluate E[Z”] for very large N, leaving the limit n — O for later. In 
the large N limit, the integral over the matrix q can be done by the saddle point method. 
More precisely, we should find an extremum of F(q) in the n(n + 1)/2 elements of q. 
Alternatively we can diagonalize q, introducing the log of a Vandermonde determinant in 


the exponential (see Section 5.1.4).? In terms of the eigenvalues gq of q, one has 


n 2 
q 1 
F(lqa}) = } 503 + lose — da) — 7 Dog lda — apl. (13.18) 
a=1 a+ß 


To find the saddle point, we take the partial derivatives of F {qa} with respect to the {qq} 
and equate them to zero: 


1 1 2 
ee > =0. (13.19) 
o Z — qa N Sgp 17 IB 


The effect of the last term is to push the eigenvalues gy away from one another, such that 
in equilibrium their relative distance is of order 1/N and the last term is of the same order 
as the first two. Since there are only n such eigenvalues, the total spread (from the largest 
to smallest) is of order n/N, which we will neglect when N — oo. Hence we can assume 
that all eigenvalues are identical and equal to q* (z), where q*(z) satisfies 


o2 


z-g= = (13.20) 

q 
We recognize the self-consistent equation for the Stieltjes transform of the Wigner (Eq. 
(2.35)) if we make the identification g*(z) = o7gx(z). For N large and n small we indeed 


have 


ira Nn E q? 
S[Z"] = exp -5 ra (z)) ) with Fi (z,4) = 332 + log(z — q), (13.21) 
SO 
IZ*J-1  Filzq* 
a te eee) (13.22) 
n—0N—-00 Nn 2 


Finally, from Eq. (13.8), we should have 
d 
gx(z) = ro (z,q*(z)). (13.23) 


2 A third method is used in spin-glass problems where the integrand has permutation symmetry but not necessarily rotational 
symmetry, see Section 13.4. 
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To finish the computation we need to take the derivative of F; (z, q*(z)) with respect to z, 
but since g*(z) is an extremum of F}, the partial derivative of F(z,q) with respect to q is 
zero at q = q*(z). Hence 

1 q* (z) 


ð 
gx(z2) = —Fi(z.q) = -=I (13.24) 
dz q=) E74) g: 


We thus recover the usual solution of the self-consistent Wigner equation. 


13.2 Resolvent Matrix 
13.2.1 General Case 


We saw that the replica trick can be used to compute the average Stieltjes transform of 
a random matrix. The Stieltjes transform is the normalized trace of the resolvent matrix 
Ga(z) = (z1 — A)7!. In Chapter 19 we will need to know the average of the elements 
of the resolvent matrix for free addition and multiplication. These averages can also be 
computed using the replica trick. An element of an inverse matrix can indeed be written as 
a multivariate Gaussian integral: 


ap 1f dy vy’ My y'My 
[M =z =Z] Qn)N2 Mv »(- 2 jz -fji j2 ~ »( 2 ). 
(13.25) 


which we can rewrite as 


[Mm], = tim 2" fay wiv; exp (-* ~*). (13.26) 


If we express Z” for m € N* as m Gaussian integrals and combine them with the integral 
with the y; y; term (which we label number 1) we get, with n = m + 1: 


[M], = = im f Toe ae wiv; a (- 5 Va e) = (Yu Vij) 


a=1 
(13.27) 
This equation can then be used to compute averages of elements of the resolvent matrix 
by using M = z1 — A for the relevant random matrix A. For example, in the case of 
Gaussian Wigner matrices, the correlation (Wiwi can be computed using the saddle 
point configuration of the w’s. Since different i all decouple and play the same role, it is 
clear that 


Gx (z)]ij = (Vui Yj- = dij9x(2), (13.28) 


i.e. the average resolvent of a Gaussian matrix is simply the average Stieltjes transform 
times the identity matrix. 
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13.2.2 Free Addition 


In this section we will show how to use Eq. (13.27) to compute the average of the full 
resolvent for the sum of two randomly rotated matrices. Since we know these matrices are 
free in the large N limit, we expect to recover the additivity of R-transforms, but we will in 
fact obtain a slightly richer result. Also, the replica method is very convenient to manipulate 
and resum the perturbation theory alluded to in Section 12.2. 

Consider two symmetric matrices A and B and the new matrix C = A+OBO’ where O 
is arandom orthogonal matrix. We want to compute 


Ge(z)] = [ea -A— 080°) | i (13.29) 


where the expectation value is over the orthogonal matrix O. We can always choose B to 
be diagonal. If B is not diagonal to start with, we just absorb the orthogonal matrix that 
diagonalizes B in the matrix O. Expressing Gc(z) in the eigenbasis of A is equivalent 
to choosing A to be diagonal. In that basis, the off-diagonal elements of E[G¢(z)] must 
be zero by the following argument: since both A and B are diagonal, for every matrix O 
that contributes to an off-diagonal element of E[Gc(z)] there exists an equally probable 
matrix O’ with the same contribution but opposite sign, hence the average must be zero. 
Note that while the average matrix E[Gc(z)] commutes with A, a particular realization of 
the random matrix Gc(z) (corresponding to a specific choice for O) will not in general 
commute with A. 
Now, let us use the replica formalism to compute E[Ge¢(z)], i.e. start with 


"wi(dl—A 
Gc = lim sf I ote 0 yo wae - ie 


a=1 


xE > È i] ; (13.30) 


a=1 


where now we skip the i, j indices on Ya, treated as vectors. 
The last term with the expectation value can be rewritten as 


I=E e (3 +10, ,0n0) , (13.31) 
a=l1 


where Qa, g = Vas /N is ann x n symmetric matrix. We recognize the Harish-Chandra— 
Itzykson—Zuber integral discussed in Chapter 10, with one matrix (7), Qa,a) being at 
most of rank n < N, so we can use the low-rank formula Eq. (10.45). Our expectation 
value thus becomes 


I = exp G Trn m) ; (13.32) 


where Tr, denotes the trace of an n x n matrix and Hpg is the anti-derivative of the 
R-transform of B. 
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We now need to perform the integral of y, in Eq. (13.30). But in order to do so we must 
deal with Tr Hp(Q), which is a non-linear function of the Yy. The trick is to make the 
matrix Q an integration variable that we fix to its definition using a delta function. The 
delta function is itself represented as an integral over another (n x n) symmetric matrix Y 
along the imaginary axis. In other words, we introduce the following representation of the 
delta function in Eq. (13.30): 


N ia 
nan “XP |73 Trn orts X Yob Vati |> (13.33) 


a N2@+)D/2qy 
—ioo ABE 


where the integrals over dQ and dY are over symmetric matrices. We have absorbed a factor 
of N in Y and a factor of 2 on its diagonal, hence the extra factors of 2 and N in front of 
dY. We can now perform the following Gaussian integral over wy: 


N n 
1 
Jij - {TT % aati exp So 5 Wak(zôa, g — akôa, p — Yap) Ypk ; 


k=la=1 k=1 a, B=1 
(13.34) 


where we have written the vectors Y% „ in terms of their components Wox, and where ax are 
the eigenvalues of A. We notice that the Gaussian integral is diagonal in the index k, so we 
have N — 1 n-dimensional Gaussian integrands differing only by their value of ag, and a 
last integral for k = i = j, where the term Y? is in front of the Gaussian integrand (the 
integral is zero if i + j, meaning that Gc(z) is diagonal, as expected). The result is then 


N 
Jy = 8y[(@— an -— YY] Dee = a)n — YY”, (13.35) 


where the first term is the 11 element of ann x n matrix, coming from the term Wie 
Returning to our main expression Eq. (13.30) and dropping constants that are 1 as 
n— 0, 


AGO = tim f Q fas [c-a vr], 


N 
N 1 

x exp š (- Trn QY + Tr, HB(Q) — = Z Tr, log((z — ap) ln — »)| : 
(13.36) 
For large N the integral over Y and Q is dominated by the saddle point, i.e. the extremum 
of the argument of the exponential. The inverse-matrix term in front of the exponential does 
not contain a power of N so it does not contribute to the determination of the saddle point. 
The extremum is over a function of two n x n symmetric matrices. Taking derivatives with 

respect to Qog and equating it to zero gives 


Yop = [RB (Q) lag, (13.37) 
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and similarly when taking derivatives with respect to Yag: 


N 


0,6 = _ y [e ae — w ! (13.38) 


k=] 


Let us look at these equations in a basis where Q is diagonal. The first equation shows that 
Y is also diagonal, so that the second equation reads 


N 


1 1 
= = Zz — You): 13.39 
Qoo N 2 AZ You gal aw) ( ) 
Hence all n diagonal elements Qoo and Yea, & = 1,...,n satisfy the same pair of equa- 


tions. For large z, there is a unique solution to these equations, hence Q and Y must 
be multiples of the identity Q = q*1, and Y = y*1,, as expected from the rotational 
symmetry of the argument of the exponential that we are maximizing. The quantities q* 
and y* are the unique solutions of 


y = Rg(q) and q = ga (z — y). (13.40) 


The saddle point for Y is real while our integral representation Eq. (13.33) was over purely 
imaginary matrices; but for large values of z, the solutions of Eqs. (13.40) give small values 
for g* and y*, and for such small values the integral contour can be deformed without 
encountering any singularities. We also justify the use of Rg(q*) = Hy (q*) as q* can be 
made arbitrarily small by choosing a large enough z. 

The expectation of the resolvent is thus given, for large enough z and N, by 


Ge(z)i;] 


5; nN dive 
aa IJ ok Ok * * 
~ lim JE ( q*y + Hata?) yY ee-a-); 


n—>0 (z — aj — y*) 


k=1 
(13.41) 
As n — 0 the exponential drops out and we obtain, in matrix form, 
“[Gc(z)] = Ga(z — Rg(q*)) with q“ = ga(z — Rg(q*)). (13.42) 


13.2.3 Resolvent Subordination for Addition and Multiplication 


Equation (13.42) relates the average resolvent of C to that of A. By taking the normalized 
trace on both sides we find 


q* = 9c(z) = ga(z — Rp(q*)), (13.43) 


which is precisely the subordination relation for the Stieltjes transform of a free sum that 
we found in Section 11.3.8. We just have rederived this result once again with replicas. 
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But what is more interesting is that we have found a relationship for the average of a matrix 
element of the full resolvent matrix, namely 


[Gc (z)] = Ga C — Ra(gc(Z))). (13.44) 


This relation will give precious information on the overlap between the eigenvalues of A 
and B and those of C. Note that, by symmetry, one also has 


“[Ge(z)] = Gg (z — Ra(ge()))- (13.45) 


1 1 ue : 
In the free product case, namely C = A2 BA? where A and B are large positive definite 
matrices whose eigenvectors are mutually random, a very similar replica computation gives 
a subordination relation for the average T matrix: 


[Te (¢)] = Ta [SB te (¢))¢4], (13.46) 


with Sg(t) the S-transform of the matrix B. If we take the normalized trace on both sides, 
we recover the subordination relation Eq. (11.109). Equation (13.46) can then be turned 
into a subordination relation for the full resolvent: 


E[Gc(z)] = S*Ga(zS*) with S* := Sp(zac(z) — 1). (13.47) 


13.2.4 “Quenched” vs “Annealed” 


The replica trick is quite burdensome as one has to keep track of n copies of an integration 
vector w,, and these vectors interact through the averaging process. At large N one typically 
has to do a saddle point over one or several n x n matrices (e.g. Q and Y in the free 
addition computation of the previous section), and at the end take the n — O limit. But in 
all computations of Stieltjes transforms so far, taking n = 1 instead of n — 0 gives the 
correct saddle point and the correct final result. In other words, assuming that 


[log Z] ~ log E[Z] (13.48) 


leads to the correct result. For historical reasons coming from physics, E[log Z] is called a 
quenched average whereas log E[Z] is called an annealed average. 

For example, if we go back to Eq. (13.21) we see that taking the logarithm of the n = 1 
result gives the same result as the correct n — 0 limit. The same is true for the Wishart 
case. For the free addition and multiplication one can also compute the Stieltjes transform 
using n = 1. This is a general result for bulk properties of random matrices. Most natural 
ensembles of random symmetric matrices (such as those from Chapter 5 and those arising 
from free addition and multiplication) feature a strong repulsion of eigenvalues. Because 
of this repulsion, eigenvalues do not fluctuate much around their classical positions — see 
the detailed discussion in Section 5.4.1. It is the absence of eigenvalue fluctuations on the 
global scale that makes the n = 1 and n — 0 saddle points equivalent. 

For the rank-1 HcIz integral, on the other hand, things are more subtle. As we show 
in the next section, the annealed average n = 1 gives the right answer in some interval 
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of parameters, when the integral is dominated by the bulk properties of eigenvalues. Out- 
side this regime, fluctuations of the largest eigenvalue matter and the n = | result is no 
longer correct. 


13.3 Rank-1 acız and Replicas 


In Chapter 10 we studied the rank-1 nciz integral and defined the function Hg (t) as 
Apg(t) := li z l x Tr TOBO” (13.49) 
Bt) = Jim y log\exp| | Tr X ; 


where the averaging is done over the orthogonal group for O, T is a rank-1 matrix with 
eigenvalue ¢ and B a fixed matrix. If B is a member of a random ensemble, such as the 
Wigner ensemble, the averaging over O should be done for a fixed B and only later the 
function Hg(t) can be averaged, if needed, over the randomness of B (quenched average). 
One could also do an annealed average over B, defining another function H (t) as 


A(t) := lim ES log (exp G Tr roso” )) . (13.50) 
Noo N 2 OB 
It turns out that for small enough values of t, the two quantities are equal, i.e. H(t) = 
Hpg (t). For larger values of t, however, these two quantities differ. The aim of this section is 
to compute explicitly these quantities in the Wigner case using the replica trick, and show 
that there is a phase transition for a well-defined value t = te beyond which quenched and 
annealed averages do not coincide. 


13.3.1 Annealed Average 


Let us compute directly the “annealed” average when B = X is a Wigner matrix and T = 


t ejej, where t is the only non-zero eigenvalue of T and e4 is the unit vector (1,0, ... 0)”. 
Then 
N Nt , 
exp | — Tr TX = (exp | —e] Xe; 
2 x 2 X 
dXıı NtXıı N 2 
= | ee exp “2A 
v4nro?/N 2 40 
Nee (13.51) 
= exp | — —— |, : 
Pla 
so the annealed Ayig (t) is given by 
A o2 
Hyig(t) = a (13.52) 


which, at least superficially, coincides with the integral of the R-transform of a Wigner 
matrix, Eq. (10.61). 
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13.3.2 Quenched Average 


The annealed average corresponds, in the replica language, ton = 1. Let us now turn to 
arbitrary (integer) n. To keep notation light we will set? = 1. As in Eq. (10.31) we define 
the partition function 


1 T 
z= (eae Be nea 5 (Iy? - Nr) exp (5 y xv). (13.53) 
seeking to compute, at the end of the calculation, 
[Hx (t)] li 2 li ARa 1 — logt (13.54) 
j = lim im o : 
3 N>œ N n>0 n Bik 


where | + log t is the large N limit of 2/N log Z: (X = 0), with Z; (0) given by Eq. (10.38). 
If we write Z? (X) as multiple copies of the same integral and express the Dirac deltas as 
Fourier integrals over z, we get 


1 n 
Z; X) = a [lee f Toh aes vh 2 Ne = Za WaWa) + sy Xv.) 
7 (13.55) 


In order to take the expectation value over the Gaussian random matrix X, we need as 
always to separate the diagonal and off-diagonal elements of X. The steps are the same as 
those we took in Section 13.1.2: 


gee Meg fae á 
Henn 


i, j=1 a=), 


n N 2 
1 
=ep| zx È (Erawa) ; (13.56) 
a,B=1 \i=1 


which can be rewritten as a Hubbard-Stratonovich integral over an n x n matrix q: 


Trq? oA, dap Vai Ypi 
aL.. ]= C d —N—— SEP EERE. | 13.57 
[...] (n) i qexp 7 +e 5 ( ) 


where C(7) is a numerical coefficient. After Gaussian integration, one thus finds 


N Ti 
[Zr] = [i I | dzu f daco exp | 5 (: Trz + 


which makes sense provided that the real part of z is larger than all the eigenvalues of q. 
We now define 


2 


Trlog(z — »)| , (13.58) 


Trq? 
Fa (q, z;t) = t Trz — Ea Trlog(z — q), (13.59) 
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where z is the vector of za treated as a diagonal matrix. For n > 1 we need to find a 
saddle point in the space of n x n matrices z and q, i.e. a point in that space where the first 
derivatives of F;(q,z) are zero with a negative Hessian. 

As a check, for n = 1 we have at the saddle point 


i J d t : Ais IP hae ak est (13.60) 
= ———_ an SS = = an =f. s 
q ze q* z*— q* z t q 
Hence 
t2 
Fi(q*,z*; t) = +1- F logt, (13.61) 
or 
` t? 
Hyig(t) = 7> (13.62) 


as it should be. 
We now go back to the general n case. Using Eq. (1.37) we can take a matrix derivative 
of Eq. (13.59) with respect to q and z: 


q=(@—q)! and K = a]. = (13.63) 


In the following technical part, we solve these equations for integer n > 1. We discuss the 
final result at the end of this subsection. 


The second equation in (13.63) comes from the derivative with respect to z; remember 
z is only a diagonal matrix, the derivative with respect to z tells us only about the diagonal 


elements. 
From this we can argue that z must be a multiple of the identity and q of the form? 
t b... b 
b t b 
q= , (13.64) 
mo b 
b b t 


for some b to be determined. To find an equation for b and z we need to express Eqs. 
(13.63) in terms of those quantities. To do so we first write the matrix q as a rank-1 
perturbation of a multiple of the identity matrix: 


q= (t — b)1 +nbP}, (13.65) 


where P| = ee’ is the projector onto the normalized vector of all 1: 


1 
1 1 
e= T fan be (13.66) 
n f 
1 


3 More complicated, block diagonal structures for q sometimes need to be considered in the limit n — 0. This is called “replica 
symmetry breaking”, a phenomenon that occurs in many “complex” optimization problems, such as spin-glasses — see Section 
13.4. Fortunately, in the present case, these complications are not present. 
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Note that the eigenvalues of the matrix q are (t — b) + nb (with multiplicity 1) and (t — b) 
(with multiplicity (n — 1)). Since z is a multiple of the identity, the matrix z — q is a 
rank-1 perturbation of a multiple of the identity and it can be inverted using the Sherman— 
Morrison formula, Eq. (1.28). The first of Eqs. (13.63) becomes 


1 nbP, 


t —b)1+nbP] = ! 
PAPETA E o Eb CAEDE nie arena) 


(13.67) 


We can now equate the prefactors in front of the identity matrix 1 and of the projector P4 
separately, to get two equations for our two unknowns (z and b). For the identity matrix 
we get 


1 1 
t — b) = ———— = (t — b) + —. 13.68 
-d= cH (13.68) 
For the second equation, we first replace (z — t + b)! by t — b and get 
(t — b)?nb 
nb = ——————. (13.69) 
1 — nb(t — b) 


We immediately find one solution: b = 0. For this solution both q = gol and z = zol 
are multiples of the identity and we have gq = t and zo = t + t—!. This coincides with 
the (unique) solution we found in the annealed (n = 1) case. For general n, there are 
potentially other solutions. Simplifying off nb, we find a quadratic equation for b: 
1 —nb(t — b) = (t —b)?, (13.70) 

whose solutions we write as 

p -Ot (n?t? — 4(n — 1)) 
oad 2(n — 1) 


(13.71) 


From the two solutions for b we can compute the corresponding values of z using Eq. 
(13.68). We get a term with a square-root on the denominator that we simplify using 
(c+ Jd)7! =(c# Vd) / (c? — d). After further simplification we find 


n2t + (n — 2) y (n2t? — 4(n — 1)) 
we er : (13.72) 


We now need to choose one of the three solutions zg, z+ and z_. First, we consider 
integer n > 1 where the replica method is perfectly legitimate. We will later deal with the 
n — 0 limit. 

Forn = 1, z— is ill defined while z+ becomes identical to zg. The only solution for all 
t is therefore zo and we recover the annealed result discussed in the previous subsection. 

For n > 2, we first notice that the solutions z+ do not exist for t < ts := 2 y/n — 1/n; 
they yield a complex result when the result must be real. So for t < ts, zg is the solution. 
For larger values of t we should compare the values of F;,(q*,z*;t) and choose the 
maximum one. For t > fs, the z+ solution always dominates z_, so we only consider 
z+ and zg henceforth. 

For n = 2, the analysis is easy, ts = 1 and fe = 1: att = 1, z+ = zg and for t > 1 
the z solution dominates. For n > 2, the situation is a bit more subtle. The z+ solution 
appears at ts < 1 but at that point zg still dominates. Att = 1, z+ dominates. At some 
t = te, with ts < te < 1, we must have Fy (qo, Zo; tc) = Fn (q+, Z+; fc). This point could 
in principle be shown analytically but it is easier numerically. In particular we do not have 
an analytical expression for f¢(m) except the above bound te(n > 2) < 1 and the value 
te(2) = 1 (see Fig. 13.1). 
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n 


Figure 13.1 The point te(n) where the Fn (qo, Zo; tc) = Fn(q+,Z+;tc) and beyond which the z+ 
solution dominates. Also shown is the point t(n) = 2J/n—1 /n where the z+ solution starts to 
exist. Note that ts < fc < 1 forall n > 2. Hence the transition appears below tf = 1. Note also that 
te > Oasn > œ. 


We can now put everything together but there is a trick to save computation effort. 
Given that our solutions cancel the partial derivatives of F; (q, z; ft) with respect to q and z, 
we can easily compute its derivative with respect to t: 


d a 
qn eo =z (q*,z*; t) = Trz* (t) = nz“ (0). (13.73) 


Note that we can follow the value of F,(q*,z*;1) through the critical point tg because 
F,(q*,z*;t) is continuous at that point (even if its derivative is not). 


The above analysis therefore allows us to find, forn > 1, 


log E[Z?'] ~ Fu), (13.74) 
where 
d t++ fort < t(n), 
gOS aa E Kasnak (13.75) 


with the boundary condition Fn (0) = 0. 

We can now analytically continue this solution down to n — 0. The first regime, for 
small ż, is easy as it does not depend on n. In the large t regime, the extrapolation of z+ (t) 
ton — 0 gives the very simple result (see Eq. (13.72)): z+ = 2 for all t. The most tricky 
part is to find the critical point where one goes from the z9 = t+ 1/t solution to the z+ = 2 
solution. We cannot analytically continue te(n) ton — 0, as we have no explicit formula 
for it. On the other hand, we can directly find the point te (n = 0) at which the two solutions 
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Figure 13.2 The function Hx(t) for a unit Wigner computed with a “quenched” average (HCIZ 
integral) and an “annealed” average. We also show the upper bound given by Eq. (10.59). The 
annealed and quenched averages are identical up to t = fg = 1 and differ for larger t. The annealed 
average violates the bound, which is expected as in this case Amax fluctuates and exceptionally large 
values of Amax dominate the average exponential HCIZ integral. 


lead to the same Fo. It is relatively straightforward to show that this point is te = 1.4 
Correspondingly, 


t+ fort <1(0) =1, 
2 fort > t,.(0) = 1. 


We can now go back to the definition of the function E[Hx(t)] (Eq. (13.54)). After 
taking the n —> 0 and N — oo limits we find (see Fig. 13.2) 


d 
dt 
with the condition E[Hx(t = 0)] = 0. 

The upshot of this (rather complex) calculation is that, as announced in the introduction, 
for t < te = 1 the quenched and annealed results coincide, i.e. Ayig(t) = E[Ax(t)]. 
For t > fc, on the other hand, the two quantities are different. The reason is that for 
sufficiently large values of t, the average HCIZ integral becomes dominated by very rare 
Wigner matrices that happen to have a largest eigenvalue significantly larger than the 
Wigner semi-circle edge A = 2. This allows Ayig (t) to continue its quadratic progression, 
while E[Hx(t)] is dominated by the edge of the average spectrum and its growth with t 
is contained (see Fig. 13.2). When one computes higher and higher moments of the HCIZ 


d 
rO = | (13.76) 


t fort < 1, 


2-4 fort > 1, 


(13.77) 


SL Hx (t)] = | 


4 Something peculiar happens as n —> 0, namely the minimum solution becomes the maximum one and vice versa. In other 
words for small t, where we know that zg is the right solution, we have Fg(zqg;t) < Fo(z+;t) and the opposite at large t. This 
paradox is always present within the replica method when n — 0. 
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integral (i.e. as the number of replicas n increases), the dominance of extreme eigenvalues 
becomes more and more acute, leading to a smaller and smaller transition point te (n).5 


13.4 Spin-Glasses, Replicas and Low-Rank HCIz 


“Spin-glasses” are disordered magnetic materials exhibiting a freezing transition as the 
temperature is reduced. Typical examples are silver-manganese (or copper—manganese) 
alloys, where the manganese atoms carry a magnetic spin and are randomly dispersed 
in a non-magnetic matrix. Contrary to a ferromagnet (i.e. usual magnets like permalloy), 
where all microscopic spins agree to point more or less in the same direction when the 
temperature is below a certain transition temperature (872 K for permalloy), spins in spin- 
glasses freeze, but the configuration they adopt is disordered, “amorphous”, with zero net 
magnetization. 

A simple model to explain the phenomenon is the following. The energy of N spins 
S; = +1 is given by 


N 
1 
HS) = 5 $ JijsiSj, (13.78) 
i, j=1 


where J is a random matrix, which we take to be drawn from a rotational invariant 
ensemble, i.e. J = OAO” where O is chosen according to the (flat) Haar measure over 
O(N) and A is a certain fixed diagonal matrix with t (A) = 0, such that any pair of spins is 
as likely to want to point in the same direction or in opposite directions. The simplest case 
corresponds to J = X, a Wigner matrix, in which case the spectrum of A is the Wigner 
semi-circle. This case corresponds to the celebrated Sherrington—Kirkpatrick (SK) model, 
but other cases have been considered in the literature as well. 

The physical properties of the system are encoded in the average free energy F, 
defined as 


(13.79) 


F :=—TEy [log Z]; Z= Dew ( 


an) 
{S} 


where the partition function Z is obtained as the sum over all 2N configurations of the N 
spins, and T is the temperature. One of the difficulties of the theory of spin-glasses is to 
perform the average over the interaction matrix J of the logarithm of Z. Once again, one 
can try to use the replica trick to perform this average, to wit, 


ð 
Ey [log Z] = 5 EJ [z”"] (13.80) 


n=0 


One then computes the right hand side for integer n and hopes that the analytic continua- 
tion to n — 0 makes sense. Introducing n replicas of the system, one has 


n N 
=p Xi ja Sip SP SF 
By [Z"] =E; yo exp{ 27 m E a (13.81) 


Now, for a fixed configuration of all nN spins ts}, S2, ..., S”}, the N x N matrix KO = 


ey SPSE/N is at most of rank n, which is small compared to N which we will 


5 A very similar mechanism is at play in Derrida’s random energy model, see Derrida [1981]. 
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take to infinity. So we have to compute a low-rank HCIz integral, which is given by 
Eq. (10.45): 


N 
Ey [exp (= TroaorK™)| ~ exp E Tr MKT) 1 (13.82) 


where Hy is the anti-derivative of the R-transform of J (or A). Now the non-zero eigen- 
values of the N x N matrix K™ are the same as those of the n x n matrix Qag = 


N 1S7 sf /N, called the overlap matrix, because its elements measure the similarity of 


the configurations {S“} and {sP }. Hence, we have to compute 
N 
g[z]= >) exp E Trn H/T) l (13.83) 
{S1, S2,..., S”} 


It should be clear to the reader that all the steps above are very close to the ones followed in 
Section 13.2.2. We continue in the same vein, introducing a new n x n matrix for imposing 


the constraint Qag = eG, S% SP IN: 


E N2at+)/2qy 


n N 
nan fao -NT QY + SO Yap X SES . (13.84) 
—100 


a, ß=1 i=1 


The nice thing about this representation is that sums over i become totally decoupled. So 
we get 


i00 
Ey[Z"] = cf. av f cQexp( Ft Hy(Q/T) — N Tin oy +e), 
=a (13.85) 


where C is an irrelevant constant and 
n a B 
SY) :=logZ with Z:= |Y eXer Yass | (13.86) 
S 


In the large N limit, Eq. (13.85) can be estimated using a saddle point method over Y and 
Q. As in Section 13.2.2 the first equation reads 


1 
Yop = zr RIQ/T lap: (13.87) 


and taking derivatives with respect to Yyg, we find 
1 n 1 g S% SB! 
Op D T, (13.88) 
S 


which leads to the following self-consistent equation for Q: 


Ess ghet Ew, p= RIQ Darg S% SP 


- —— := (S%SB) 7, (13.89) 
Es er Eor plat RI(Q/T ly pi S% SP 


Qag 


At sufficiently high temperatures, one can expect the solution of these equations to be 
“replica symmetric”, i.e. 


Qag = ĉap — 4) + 4. (13.90) 
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This matrix has two eigenvalues, one non-degenerate equal to 1 + (n — 1)q and another 
(n — 1)-fold degenerate equal to 1 — q. Correspondingly, Ry(Q) has eigenvalues 
Ry(U. + (n — 1)q)/T) and Ry((1 — g)/T), from which we reconstruct the diagonal 
and off-diagonal elements of Ry(Q): 


1 
r:=[RyQ)lop 3 [R0 + a- 1)q)/T) — Ry — 4)/T)], 


ra = [Ry(Qlaw = R(Q —4)/T) +r. (13.91) 


Injecting in the definition of Z, we find 


1 n 
Z= X exp — | nra +r Ne 5% 5B ; (13.92) 
S 


2T 
a#p=1 


where we have used S2 = 1. Writing X a+g=1 ses = X} sv)? — n and using a 
Hubbard-Stratonovich transformation, one gets 


ror f+ x? re 
Ze | dxexp| -5 +2 OD ; (13.93) 
S =99 a 


The sums of different S” now again decouple, leading to 


rq-r +00 ar. 
Z= g enat | dr cosh” E [=]. (13.94) 
—o0o IU 


Now, one can notice that 


Y (8°58) 7p = n(n — gq = or? wee, (13.95) 
r 


a+ß=l1 


to get, in the limit n —> 0 and a few manipulations (including an integration by parts), an 
equation involving only q: 


m da ot i A (13.96) 
= e anh“ |x ,/— l, ; 
-œ sv2r T 


where, in the limit n > 0, 


1 = 
r= aR, (=) l (13.97) 


Clearly, q = 0 is always a solution of this equation. The physical interpretation of q is 


the following: choose randomly two microscopic configurations of spins {S7} and {S A h 
each with a weight given by exp (42) /Z. Then, the average overlap between these 
configurations, ye Se sf /N, is equal to q. When q = 0, these two configurations are 
thus uncorrelated. One expects this to be the case at high enough temperature, where the 
system explores randomly all configurations. 

When the spins start freezing, on the other hand, one expects the system to strongly 
favor some (amorphous) configurations over others. Hence, one expects that q >0 
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in the spin-glass phase. Expanding the right hand side of Eq. (13.96) for small q 


gives 
e anh* | x ,/= 
-œ <J27 T 


j ” , 2 
RI(/T) R!(JT) RAJT) 
= a q S +2/ tS +0(q°). (13.98) 


Assuming that the coefficient in front of the q? term is negative, we see that a non-zero q 
solution appears continuously below a critical temperature Te given by 


2 1 1 
Te = R(x). (13.99) 


When J is a Wigner matrix, the spin-glass model is the one studied originally by Sherring- 
ton and Kirkpatrick in 1975. In this case, Ry(x) = x and therefore Te = 1. There are cases, 
however, where the transition is discontinuous, i.e. where the overlap q jumps from zero 
for T > Te to a non-zero value at Te. In these cases, the small q expansion is unwarranted 
and another method must be used to find the critical temperature. One example is the 
“random orthogonal model”, where the coupling matrix J has N/2 eigenvalues equal to 
+1 and N/2 eigenvalues equal to —1. 

The spin-glass phase T < Te is much more complicated to analyze, because the replica 
symmetric ansatz Qog = dag(1 — q) + q is no longer valid. One speaks about “replica 
symmetry breaking”, which encodes an exquisitely subtle, hierarchical organization of the 
phase space of these models. This is the physical content of the celebrated Parisi solution 
of the SK model, but is much beyond the scope of the present book, and we encourage the 
curious reader to learn more from the references given below. 
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14 
Edge Eigenvalues and Outliers 


In many instances, the eigenvalue spectrum of large random matrices is confined to a 
single interval of finite size. This is of course the case for Wigner matrices, where the 
correctly normalized eigenvalues fall between A_ = —2 and à} = +2, with a semi- 
circular distribution between the two edges, and, correspondingly, a square-root singularity 
of the density of eigenvalues close to the edges. This is also the case for Wishart matrices, 
for which again the density of eigenvalues has square-root singularities close to both edges. 
As discussed in Section 5.3.2, this is a generic property, with a few notable exceptions. One 
example is provided by Wishart matrices with parameter g = 1, for which the eigenvalue 
spectrum extends down to A = O with an inverse square-root singularity there. Another 
case is that of Wigner matrices constrained to have all eigenvalues positive: the spectrum 
also has an inverse square-root singularity — see Eq. (5.94). One speaks of a “hard edge” in 
that case, because the minimum eigenvalue is imposed by a strict constraint. The Wigner 
semi-circle edge at A+ = 2, on the other hand, is “soft” and appears naturally as a result 
of the minimization of the energy of a collection of interacting Coulomb charges in an 
external potential. ! 

Consider for example Wigner matrices of size N. The existence of sharp edges delimit- 
ing a region where one expects to see a non-zero density of eigenvalues from a region where 
there should be none is only true in the asymptotically large size limit N — oo. For large 
but finite N, on the other hand, one expects that the probability to find an eigenvalue beyond 
the Wigner sea is very small but non-zero. The width of the transition region, and the tail 
of the density of states was investigated a while ago, culminating in the beautiful results by 
Tracy and Widom on the distribution of the largest eigenvalue of a random matrix, which 
we will describe in the next section. The most important result is that the width of the region 
around A, within which one expects to observe the largest eigevalue of a Wigner matrix 
goes down as N~7/3, 

Hence the largest eigenvalue Amax does not fluctuate very far away from the classical 
edge à+. Take for example N = 1000; Amax is within 10007?/⁄3 = 0.01 away from A+ = 2. 
In real applications the largest eigenvalue can deviate quite substantially from the classical 


1 Note that there are also cases where the soft edge has a different singularity, see Section 5.3.3, and cases where the eigenvalue 
spectrum extends up to infinity, for example “Lévy matrices” with 11D elements of infinite variance. 
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edge. The origin of such a large eigenvalue is usually not an improbably large Tracy— 
Widom fluctuation but rather a true outlier that should be modeled separately. This is the 
goal of the present chapter. We will see in particular that perturbing a Wigner (or Wishart) 
matrix with a deterministic, low-rank matrix of sufficient amplitude a > ac generates 
“true” outliers, which remain at a distance O(a) from the upper edge. For a < ac on the 
other hand, the largest eigenvalue remains at distance N~7/? from A+. 


14.1 The Tracy-Widom Regime 


The Tracy—Widom result characterizes precisely the distance between the largest eigen- 
value Amax of Gaussian Wigner or Wishart matrices and the upper edge of the spectrum 
which we denoted by A. This result can be (formally) stated as follows: the rescaled 
distribution of Amax — à+ converges, for N — oo, towards the Tracy-Widom distribution, 
usually noted F): 


p (Amax Bhi. yN~*3u) = Fi), (14.1) 


where y is a constant that depends on the problem and F; (uw) is the 8 = 1 Tracy-Widom 
distribution. For the Wigner problem, A+ = 2 and y = 1, whereas for Wishart matrices, 
Az =UA+ Ja” and y = Se. In fact, Eq. (14.1) holds for a much wider large class 
of N x N random matrices, for example symmetric random matrices with arbitrary 1D 
elements with a finite fourth moment. The Tracy—Widom distribution for all three values 
of $ is plotted in Figure 14.1. (The case where the fourth moment is infinite is discussed in 
Section 14.3 below.) 


Figure 14.1 Rescaled and shifted probability density of the largest eigenvalue for a large class 
of random matrices such as Wigner and Wishart: the Tracy-Widom distribution. The distribution 
depends on the Dyson index (£) and is shown here for 6 = 1,2 and 4. 
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Everything is known about the Tracy-Widom density f\(u) = Fj (u), in particular its 
left and right far tails: 


In fi(u) x —u?/*, (u > +00); In fi(u) x —|u?, (u > —00). (14.2) 


One notices that the left tail is much thinner than the right tail: pushing the largest eigen- 
value inside the Wigner sea implies compressing the whole Coulomb gas of repulsive 
charges, which is more difficult than pulling one eigenvalue away from A. Using this 
analogy and the formalism of Section 5.4.2, the large deviation regime of the Tracy-Widom 
problem (i.e. for Amax — 44 = O(1)) can be obtained. Note that the result is exponentially 
small in N as the u*/* behavior for u — oo combines with N?/3 to give a linear in N 


dependence. 
The distribution of the smallest eigenvalue Amin around the lower edge à— is also Tracy— 
Widom, except in the particular case of Wishart matrices with q = 1. In this case A_ = 0, 


which is a “hard” edge since all eigenvalues of the empirical matrix must be non-negative.” 
The behavior of the width of the transition region can be understood using a simple 
heuristic argument. Suppose that the N = oo density goes to zero near the upper edge A+ 
as (A, — à)? (generically, © = 1/2 as is the case for the Wigner and the Maréenko—Pastur 
distributions). For finite N, one expects not to be able to say whether the density is zero or 
non-zero when the probability to observe an eigenvalue is of order 1/N, i.e. when the O(1) 
eigenvalue is within the “blurred” region. This leads to a blurred region of width 


1 
ar Ag PAT oe > AN NTE, (14.3) 


which goes to zero as N~/? in the generic square-root case @ = 1/2. More precisely, for 
Gaussian ensembles, the average density of states at a distance ~ N~7/3 from the edge 


behaves as 
PNO © da) = NA [Na a], (14.4) 


with ®ı(u > —oo) ~ ./—u/z so as to recover the asymptotic square-root singularity, 
since the N dependence disappears in that limit. Far from the edge, In ®ı (u —> +00) « 
—u°/?, showing that the probability to find an eigenvalue outside of the allowed band 
decays exponentially with N and super-exponentially with the distance to the edge. The 
function ®; (u) is not known analytically for real Wigner matrices (8 = 1) but an explicit 
expression is available for complex Hermitian Wigner matrices, and reads (Fig. 14.2) 


z(u) = Ai? (u) — u Ai? (u), (14.5) 


with the same asymptotic behaviors as ®; (u). (Ai(u) is the standard Airy function.) 


2 This special case is treated in Péché [2003]. 
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Figure 14.2 Behavior of the density near the edge A+ at the scale N —2/3 for complex Hermitian 
Wigner matrices given by Eq. (14.5). For comparison the probability of the largest eigenvalue 
fa(u) is also shown. For positive values, the two functions are almost identical and behave as 
exp(—4u3/ 2/3) /(8xu) for large u (right). For negative arguments the functions are completely 
different: (uw) behaves as ./—u/z for large negative u while f(u) — 0 as the largest eigenvalue 
cannot be in the bulk (left). 


14.2 Additive Low-Rank Perturbations 
14.2.1 Eigenvalues 


We will now study the outliers for an additive perturbation to a large random matrix. Take a 
large symmetric random matrix M (e.g. Wigner or Wishart) with a well-behaved asymptotic 
spectrum that has a deterministic right edge à+. We would like to know what happens when 
one adds to M a low-rank (deterministic) perturbation. For simplicity, we only consider the 
rank-1 perturbation auu” with |u|) = 1 and a of order 1, but the results below easily 
generalize to the case of a rank-n perturbation with n < N. 

We want to know whether there will be an isolated eigenvalue of M + auu’ outside the 
spectrum of M (i.e. an “outlier”) or not. To answer this question, we calculate the matrix 
resolvent 


Ga(z) = (2 -M—auu’)'. (14.6) 


The matrix G,(z) has a pole at every eigenvalue of M + auu”. An alternative approach 
would have been to study the zeros of the function det (z —-M- auu’), but the full resol- 
vent G4 (z) also gives us information about the eigenvectors. 

Now we apply the Sherman—Morrison formula (1.28); taking A = z — M, we get 


G(z)uu’ G(z) 


Gal) = G@) +a aG fou’ (14.7) 
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where G(z) is the resolvent of the original matrix M. We are looking for a real eigenvalue 
such that 4; > A+. Let us take z = A; € R outside the spectrum of M, so G(A) is real and 
regular. To have an outlier at à}, we need a pole of G4 at Aj, i.e. the following equation 
needs to be satisfied: 


1 — au” G(à1)u = 0. (14.8) 


Assume now that M is drawn from a rotationally invariant ensemble or, equivalently, that 
the vector u is an independent random vector uniformly distributed on the unit sphere. In 
the language of Chapter 11, we say that the perturbation auu” is free from the matrix M. 
We then have, in the eigenbasis of G, 


u’G(ju = Y uG) © TG) = gu) B” g(z). (14.9) 


l 


Thus we have a pole when 
ag(A1) = 1 = g(A1) = 1/a. (14.10) 


If 3(g), the inverse function of g(z), exists, we arrive at 


A =3 (<) , (14.11) 


The condition for the invertibility of g(z) happens to be precisely the same as the condition 
to have an outlier, i.e. à} > à+ — see Section 10.4. We have established there that à} = 
3(1/a) is monotonically increasing in a, and à} = à+ when a = a* = 1/g(A+), which 
is the critical value of a for which an outlier first appears. Generically, g} = g(à+) is a 
minimum of 3(g): 


d3(g) 


=0 when 3(g4) =A4. (14.12) 
dg |e, 


For instance, for Wigner matrices, we have 3(g) = 07g + g~!, for which 
o’ -g7 =0> g, =0!, (14.13) 


and Ay = 3(a 7!) = 20, which is indeed the right edge of the semi-circle law. 

In sum, fora > a* = 1/g,, there exists a unique outlier eigenvalue that is increasing 
with a. The smallest value for which we can have an outlier is a* = 1/g4, corresponding 
to A, =A,. Fora < a* there is no outlier to the right of 44.7 

Using the relation between the inverse function 3(g) and the R-transform (10.10), we 
can express the position of the outlier as 


1 1 
m= r(t) +a for a>a=—. (14.14) 
a &+ 


3 Outliers such that à < A_ behave similarly, we just need to consider the matrix —M — auu? and follow the same logic. 
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e simulation 


—— A= ã+ 1/4, a= max(a, 1) 


Figure 14.3 Largest eigenvalue of a Gaussian Wigner matrix with o? = 1 with arank-1 perturbation 
of magnitude a. Each dot is the largest eigenvalue of a single random matrix with N = 200. Equation 
(14.16) is plotted as the solid curve. For a < 1, the fluctuations follow a Tracy-Widom law with 
Nn~2/3 scaling, while for a > 1 the fluctuations are Gaussian with N i scaling. From the graph, 
we see fluctuations that are indeed smaller when a < 1. They also have a negative mean and positive 
skewness, in agreement with the Tracy—Widom distribution. 


Using the cumulant expansion of the R-transform (11.63), we then get a general expression 
for large a: 


k2(M) 
a 


dy =a+t(M) + +O(a~’). (14.15) 


For Wigner matrices, we actually have for all a (see Fig. 14.3) 


2 
o * 
Ay =a + — for a>a =o. (14.16) 
a 
When a — a*, on the other hand, one has 
dà d 
1a) _ BG) =0. (14.17) 
da a* dg g+=1/a* 


Hence, one has, for a —> a* and for generic square-root singularities, 
ai =À} +Cla-a*} +0 (a = a) (14.18) 


where C is some problem dependent coefficient. 

By studying the fluctuations of uG(A)u’ around g(A), one can show that the fluctuations 
of the outlier around Ay = R(a~!) + a are Gaussian and of order N~!/?. This is to 
be contrasted with the fluctuations of the largest eigenvalue when there are no outliers 
(a < g+), which are Tracy-Widom and of order N~/3. The transition between the two 
regimes is called the Baik—Ben Arous—Péché (BBP) transition. 
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Figure 14.4 Plot of the inverse function 3(g) = g + 1/g for the unit Wigner function g(z). The gray 
dot indicates the point (g+,å+). The line to the left of this point is the true inverse of g(z): 3(g) is 
defined on [0, g+) and is monotonously decreasing in g. The line to the right is a spurious solution 
introduced by the R-transform. Note that the point g = g+ is a minimum of 3(g) = g + 1/g. 


We finish this section with two remarks. 


e One concerns the solutions to Eq. (14.14), and the way to find the value a* beyond 
which an outlier appears. The point is that, while the function R(g = 1/a) is well 
defined for g € [0, g+), it also often makes sense even beyond g+ (see the discussion in 
Section 10.4). In that case, one will find spurious solutions: Figure 14.4 shows a plot of 
3(g) = R(g) + 1/g in the unit Wigner case, which is still well defined for g > g = 1 
even if this function is no longer the inverse of g(z) (Section 10.4). There are two 
solutions to 3(g) = A1, one such that g < g+ and the other such that g > g+. As 
noted above, the point g+ is a minimum of 3(g), beyond which the relation between A 
and a is monotonically increasing because g(z) is monotonically decreasing for z > Ax. 

e The second concerns the case of a free rank-n perturbation, when n < N. In this case 
one cannot use the Sherman—Morrison formula but one can compute the R-transform 
of the perturbed matrix, and infer the 1/N correction to the Stieltjes transform. The 
poles of this correction term give the possible outliers. To each eigenvalue a, (k = 


1, ...,”) of the perturbation, one can associate a candidate outlier A, given by 
1 1 
Ae = R| — |] +ag} when a> —. (14.19) 
ak E+ 


14.2.2 Outlier Eigenvectors 


The matrix resolvent in Eq. (14.7) can also tell us about the eigenvectors of the perturbed 
matrix. We expect that, for a very strong rank-1 perturbation auu”, the eigenvector vı 
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associated with the outlier 1 will be very close to the perturbation vector u. On the other 
hand, for A; ~ A+, the vector u will strongly mix with bulk eigenvectors of M so the 
eigenvector vı will not contain much information about u. 

To understand this phenomenon quantitatively, we will study the squared overlap \viul?. 
With the spectral decomposition of M + auu’, we can write 


N ee: 


Ga =), a (14.20) 


where A; denotes the outlier and vı its eigenvector, and 4;,v;, i > 1 all other eigenval- 
ues/eigenvectors. Thus we have 


lim u’ Ga(z)u- (z — a1) = [vfu]? (14.21) 
lA] 


Hence, by (14.7) and (14.9), we get 


: Z 
Ivu]? = lim {g(z) + a——— 0? (z — à1) 
z>] 1 — ag(z) 
— ài 
= m g EERE (14.22) 
an 
We cannot simply evaluate the fraction above at z = (1, for at that point g(à1) = a`! and 
we would get 0/0. We can however use l’Hospital’s rule* and find 
g(a)? 
Iviu’ = — (14.24) 
yA) 
where we have used a~! = g(à1). The right hand side is always positive since g is a 


decreasing function for A > A+. 

We can rewrite Eq. (14.24) in terms of the R-transform and get a more useful formula. 
To compute g’(z), we take the derivative with respect to z of the implicit equation z = 
R(g(z)) + 97! (z) and get 


a (z) 1 
1=R = 14.25 
(6@))!@ - = > 9'O = BEI: (14.25) 
Hence we have 
Iviu]? = 1 — 9(A1)°R’(g(A4)) 
=1—-a?R'(a"). (14.26) 


We can now check our intuition about the overlap for large and small perturbations. For a 
large perturbation a — oo, Eq. (14.26) gives 


4 L/Hospital’s rule states that 


fx) _ f’@o) 
im = 
x>x9 g(x) g'o) 


when f(xọ) = g(xọ) = 0. (14.23) 
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|vful? = max(1— a 7,0) 


Figure 14.5 Overlap between the largest eigenvector and the perturbation vector of a Gaussian 
Wigner matrix with o? = 1 with a rank-1 perturbation of magnitude a. Each dot is the overlap 
for a single random matrix with N = 200. Equation (14.31) is plotted as the solid curve. 


k2(M) 
a2 


Wu? = 1-— +0(a°?) when a> œ. (14.27) 


As expected, vı —> u when a — oo: the angle between the two vectors decreases as 1/a. 
The overlap near the transition 4; — A+ can be analyzed as follows. The derivative of 
g(z) can be written as 


At 
gz) = -f BU) ay (14.28) 
À— 


=x) 


For a density that vanishes at the edge as p (à) ~ (A+ — à)? with exponent 8 between 0 and 
1, we have that g(z) is finite at z = A+ but g' (z) diverges at the same point, as |z — Kall? 
From Eq. (14.24), we have in that case? 


Ivu]? œ (Ay — A4)? when ài > À- (14.29) 


In the generic case, one has 0 = 1/2 and, from Eq. (14.18), 41 — à+ « (a —a*)?, leading to 


vu? <a—a* when a—a’. (14.30) 


These general results are nicely illustrated by Wigner matrices, for which R(x) = o7x. 


The overlap is explicitly given by (see Fig. 14.5) 


* 


2 
Iviu? = 1-— (=) when a>a*=o. (14.31) 
a 


5 In Chapter 5, we encountered a critical density where p(A) behaves as (A+ — 1)? with an exponent 6 = 3 > 1. In this case 


q’ (z) does not diverge as z > 44 and the squared overlap at the edge of the BBP transition does not go to zero (first order 
transition). For example for the density given by Eq. (5.59) we find WP ul? = 4 at the edge 2; = 2 v2. 
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Asa —> a*, à} — 2o and Ivu]? = 2(a — a*)/a* — 0: the eigenvector becomes 
delocalized as the eigenvalue merges with the bulk. For a < a*, one can rigorously show 
that there is no information left in the eigenvalues of the perturbed matrix that would allow 
us to reconstruct u. 

Note that for A; > A+, lviul? is of order unity. In Chapter 19 we will see that this is 
not the case for the overlaps between perturbed and unperturbed eigenvectors in the bulk, 
which have typical sizes of order N~!. 


Exercise 14.2.1 Additive perturbation of a Wishart matrix 
Define a modified Wishart matrix Wj such that every element (Wj);; = 
Wi; + a/N, where W is a standard Wishart matrix and a is a constant of order 
1. W, is a standard Wishart matrix plus a rank-1 perturbation W; = W + auu”. 


(a) What is the normalized vector u in this case? 

(b) Using Eqs. (14.14) and (10.15) find the value of the outlier and the minimal a 
in the Wishart case. 

(c) The square-overlap between the vector u and the new eigenvector vı is given 
by Eq. (14.26). Give an explicit expression in the Wishart case. 

(d) Generate a large modified Wishart (q = 1/4, N = 1000) for a few a in the 
range [1,5]. Compute the largest eigenvalue A; and associated eigenvector v1. 
Plot A; and Ivu]? as a function of a and compare with the predictions of (b) 
and (c). 


14.3 Fat Tails 


The previous section allows us to discuss the very interesting situation of real symmetric 
random matrices X with 1D elements X;; that have a fat-tailed distribution, but with a finite 
variance (the case of infinite variance will be alluded to at the end of the section). In order 
to have eigenvalues of order unity, the random elements must be of typical size N~!/?, 
we write 


So 


Xij 


Xij = (14.32) 


with x;j distributed according to some density P(x) of mean zero and variance unity, but 
that decays as y|x|~!~” for large x. This means that most elements Xij are small, of 
order N~!/?, with some exceptional elements that are of order unity. The probability that 
|Xi;| > 11s actually given by 

H 2 


CO 
Pax; > Da2 f d 
JN 


Since there are in total N(N — 1)/2 ~ N?/2 such random variables, the total number of 
such variables that exceed unity is given by N?~“/?. Hence, for u > 4, this number tends 
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Figure 14.6 Distribution of the largest eigenvalue of N = 1000 Wigner matrices with elements 
drawn from a distribution with u = 5 compared with the prediction from the Tracy-Widom 
distribution. Even though as N — oo this distribution converges to Tracy-Widom, at N = 1000 
there is no agreement between the laws as the power law tail P(Amax > x) ~ x75 /VN still 
dominates. 


to zero with N: there are typically no such large elements in the considered random matrix. 
Since each pair of large entries X;; = X j; can be considered as a rank-2 perturbation of a 
matrix with all elements of order NT !/?, one concludes that for u > 4 there are no outliers, 
and the statistics of the largest eigenvalue is given by the Tracy—Widom distribution around 
à+ = 2. This hand-waving argument can actually be made rigorous: in the large N limit, 
the finiteness of the fourth moment of the distribution of matrix elements is sufficient to 
ensure that the largest eigenvalue is given by the Tracy—Widom distribution. However, one 
should be careful in interpreting this result, because although very large elements appear 
with vanishing probability, they still dominate the tail of the Tracy—Widom distribution for 
finite N. The reason is that, whereas the former decreases as N2—/2, the latter decreases 
much faster, as exp(—N (Amax — Aa) (see Fig. 14.6). 

Now consider the case 2 < u < 4. Since u > 2, the variance of Xj; is finite and 
one knows that the asymptotic distribution of eigenvalues of X is given by the Wigner 
semi-circle, with A; = 2. But now the number of large entries in the matrix X grows 
with N as N*-+/?, which is nevertheless still much smaller than N. Each large pair of 
entries X;; = Xj; = a larger than unity (in absolute value) contributes to two outliers, 
given by A = (a + 1/a). So there are in total O(N*~“/*) outliers, the density of which 
is given by 


oo 1 
(wi > 2) = nia | dx 8 (: -y= -) (14.34) 
1 
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This is a rather strange situation where the density of outliers goes to zero as N — oo as 
soon as u > 2, but at the same time the largest eigenvalue in this problem goes to infinity as 


Amax ~ NE ?, 2<p<4. (14.35) 
Finally, let us briefly comment on the case u < 2, for which the variance of the entries 


of X diverges (a case called “Lévy matrices” in the literature). For eigenvalues to remain 
of order unity, one needs to scale the matrix elements differently, as 


Xij 
Era (14.36) 
NE 
with 
rA + u) sin(24 
Pœ) ~ TEA (14.37) 


|x| >00 [zie 


where the funny factor involving the gamma function is introduced for convenience only. 
The eigenvalue distribution is no longer given by a semi-circle. In fact, the support of the 
distribution is in this case unbounded. For completeness, we give here the exact expression 


of the distribution in terms of Lévy stable laws LGP (u), where £ is called the asymmetry 


parameter and C the scale parameter.® For a given value of à, one should first solve the 
following two self-consistent equations for C and £: 


+00 
aa, C, 
c= f agigi 26h 0-1/2) 


a6 (14.38) 
IR ; 2—2, C,ß 
CB = dg sign(g)lg "77L (A — 1/8). 
—0o 
Finally, the distribution of eigenvalues pz (A) of Lévy matrices is obtained as 
CA), BIA 
PLO) = LEMP (yy, (14.39) 


u/2 


One can check in particular that this distribution decays for large A exactly as P(x) 
itself. In other words, the tail of the eigenvalue distribution is the same as the tail of the 
independent entries of the Lévy matrix. 


14.4 Multiplicative Perturbation 


In data-analysis applications, we often need to understand the largest eigenvalue of a sample 
covariance matrix. A true covariance with a few isolated eigenvalues can be treated as a 
matrix Co with no isolated eigenvalue plus a low-rank perturbation. The passage from the 
true covariance to the sample covariance is equivalent to the free product of the true covari- 
ance with a white Wishart matrix with appropriate aspect ratio q = N/T. To understand 
such matrices, we will now study outliers for a multiplicative process. 

Consider the free product of a certain covariance matrix Co with a rank-1 perturbation 
and another matrix B: 


1 1 
E = B?C2 (1 + auu”) C2B?, (14.40) 


6 More precisely, LEP (x) is the Fourier transform of exp(—C|k|# (1 + ißsign(k) tan( w/2))) for u # 1, and of 
exp(—C|k|(1 + i(2B/z)sign(k) log |k|)) for u = 1. 
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where u is a normalized eigenvector of Co with eigenvalue Xo, and B is positive semi- 
definite, free from Co, and with t(B) = 1. In the special case where B is a white Wishart, 
our problem corresponds to a noisy observation of a perturbed covariance matrix, where 
one of the modes (the one corresponding to u) has a variance boosted by a factor (1 + a). 

The matrix Eo := B? CoB2 has an unperturbed spectrum with a lower edge A_ and an 
upper edge à+. We want to establish, as in the additive case, the condition for the existence 
of an outlier A; > A+ or A, < A_, and the exact position of the outlier when it exists. 

The eigenvalues of E are the zeros of its characteristic polynomial, in particular for the 
largest eigenvalue 4; we have 


det(A11 — Ep — aB? Couu’B?) = 0. (14.41) 


We are looking for an eigenvalue outside the spectrum of Eo, i.e. 41 > A+ or Ay < A_. For 
such a 4], the matrix A; 1 — Epo is invertible and we can use the matrix determinant lemma 
Eq. (1.30): 


det(A + uv’) = det A - (1 + v'A“'u) (14.42) 

with A = à; 1 — Ep and u = —v = JaBi Cu. Equation (14.41) becomes 
det(à11 — Ep) - (: = au" C)B4Go(21)B:G; u) =0, (14.43) 
where we have introduced the matrix resolvent Go(à1) := (Ay1 — Eo) !. As we said, the 
matrix A; 1 — Kp is invertible so its determinant is non-zero. Thus any outlier needs to solve 
ahou” B? Go(à1)B2u = 1. (14.44) 


Again we assume that B is a rotationally invariant matrix with respect to Co. Then we 
know that in the large N limit Go(z) is diagonal in the basis of B and reads (see, mutatis 
mutandis, Eq. (13.47)) 


Go(z) © S*(z)GB(zS*(z)), S*(z) := Sco (zgo(z) — 1). (14.45) 
Furthermore, since u and B are also free, 
w’B?Go(41)B2u ~ NT! Tr (B2Go(41)B?) = ADBA OAD), (14.46) 
where we have recognized the T-transform of the matrix B: 
BE) = (Be = B)') l (14.47) 
Thus, the position of the outlier 4; is given by the solution of 
adoS* (A, )tp(A,S*(Aq)) = 1. (14.48) 


In order to keep the calculation simple, we now assume that Co = 1. In this case, S* = 1 
and Ap = 1, so the equation simplifies to 
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atg(A,) = 1. (14.49) 


To know whether this equation has a solution we need to know if tg(¢) is invertible. The 
argument is very similar to the one for g(z) in the additive case. In the large N limit, tg(¢) 
converges to 


A+ 

tB) = PBO) ay, (14.50) 
a O-x 

So tg(¢) is monotonically decreasing for ¢ > A+ and is therefore invertible. We then have 


A1=C(a-'!) when Ay > AG, (14.51) 


where we use the notation ¢ (t) for the inverse of the T-transform of B, in the region where 
it is invertible. 

The inverse function ¢ (t) can be expressed in terms of the S-transform via Eq. (11.92). 
We get 


=j a+1 1 

à= c(a) = ER when a > ——. (14.52) 
Spa!) tay) 
Applying the theory to a Wishart matrix B = W with 

5 = ——.. Jeeots 2 14.53 
wœ) =] Fg ` d+ /@) ( ) 

one finds that an outlier appears to the right of A+ fora > ./q, with 
= (a+1) (1 + 1) . (14.54) 

a 


For large a, we have A; ~ a + 1 +q, i.e. a large eigenvalue a + 1 in the covariance matrix 
C will appear shifted by q in the eigenvalues of the sample covariance matrix. 

Nothing prevents us from considering negative values of a, such thata > —1 to preserve 
the positive definite nature of C. In this case, an outlier appears to the left of A_ when 
a < —,/q. Its position is given by the very same equation (14.54) as above. 


Exercise 14.4.1 Transpose version of multiplicative perturbation 
Consider a positive definite rotationally invariant random matrix B and a 
normalized vector u. In this exercise, we will show that the matrix F defined by 


F = (1+ cuu’)B(1 + cwu’), (14.55) 


with c > 0 sufficiently large, has an outlier A; given by Eq. (14.52) with b+ 1 = 
(c+ 1). 
(a) Show that for two positive definite matrices A and B, B2AB? has the same 
eigenvalues as A2BA?. 
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(b) Show that for a normalized vector u 
1 


(1+ (a — Dun’)? = 1+ (Va — lun’. (14.56) 


(c) Finish the proof of the above statement. 


Exercise 14.4.2 Multiplicative perturbation of an inverse- Wishart matrix 
We will see in Section 15.2.3 that the inverse-Wishart matrix AA p is defined as 


Mp = (1—q)W-', (14.57) 


where W,, is a Wishart matrix with parameter q and p the variance of the inverse- 
Wishart is given by p = Tae The S-transform of M, is given by 


Sm, (t) = 1 — pt. (14.58) 


Consider the diagonal matrix D with Dj; = d and all other diagonal entries 
equal to 1. 


(a) D can be written as 1 + cuu”. What is the normalized vector u and the 
constant c? 

(b) Using the result from Exercise 14.4.1, find the value of the largest eigenvalue 
of the matrix DM,D as a function of d. Note that your expression will only 
be valid for sufficiently large d. 

(c) Numerically generate matrices M, with N = 1000 and p = 1/2 (q = 1/3). 
Find the largest eigenvalue of DM ,D for various values of d and make a plot 
of A; vs d. Superimpose your analytical result. 

(d) (Harder) Find analytically the minimum value of d to have an outlier 11. 


14.5 Phase Retrieval and Outliers 


Optical detection devices like CCD cameras or photosensitive films measure the photon 
flux but are blind to the phase of the incoming light. More generally, it is often the case 
that one can only measure the power spectral density of a signal, which is the magnitude 
of its Fourier transform. Can one recover the full signal based on this partial information? 
This problem is called phase retrieval and can be framed mathematically as follows. Let an 
unknown vector x € RY be “probed” with T vectors ag, in the sense that the measurement 
apparatus gives us yy = lal x|? with k = 1,...,7.7 Vectors x and a; are taken to be real 
but they can easily be made complex. 

The phase retrieval problem is 

è ; T.12 a 
X = argmin 5 lars = ve (14.59) 
* k 


7 We could consider that there is some additional noise in the measurement of yk but for simplicity we keep here with the 
noiseless version. 
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It is a difficult non-convex optimization problem with many local minima. To efficiently 
find an acceptable solution, we need a starting point xp that somehow points in the direc- 
tion of the true solution x. The problem is that in large dimensions the probability that a 
random vector Xo has an overlap |x’ xo| > € is exponentially small in N as soon as € > 0. 
We will explore here a technique that allows one to find a vector xg with non-vanishing 
overlap with the true x. 

The idea is to build some sort of weighted Wishart matrix such that this matrix will have 
an outlier with non-zero overlap with the unknown true vector x. Consider the following 
matrix: 


T 
1 
Ma, 2 f (vk) akak, (14.60) 


where the T vectors ax are of size N and f(y) is a function that we will choose later. 
The function f(y) should be bounded above, otherwise we might have outliers domi- 
nated by a few large values of f (yg). One such function that we will study is the sim- 
ple threshold f(y) := ©( — 1). In large dimensions the results should not depend 
on the precise statistics of the vectors ag provided they are sufficiently random. Here 
we will assume that all their components are standard mp Gaussian. This assumption 
makes the problem invariant by rotation. Without loss of generality, we can assume the 
true vector x is in the canonical direction e]. The weights f(y) are therefore assumed 
to be correlated to fag], |? = lace 112 and independent of all other components of the 
vectors ay. 

Given that the first row and column of M contain the element [az ];, we write the matrix 
in block form as in Section 1.2.5: 


Mi; Mi 
M= ; 14.61 
( M2; Mz ( ) 


with the (11) block of size 1 x 1 and the (22) block of size (N — 1) x (N — 1). To find a 
potential outlier, we look for the zeros of the Stieltjes transform gyq(z) = T((z1 — M)— l), 
Combining Eqs. (1.32) and (1.33), 


1 + Tr [G22 (2)M21M12G222)] 


NgM(z) = Tr G22 (z) + 
i 2 z — Mıı — Mı12G2(2)M21 


; (14.62) 


where G22(z) is the matrix resolvent of the rotationally invariant matrix Mo (i.e. the 
matrix M without its first row and column). In the large N limit we expect M22 to 
have a continuous spectrum with an edge A+. When the condition for the denominator 
to vanish, i.e. 


Ay — Mii — Mj2Go2(4.1)Mo1 = 0, (14.63) 


has a solution for 4; > à+ we can say that the matrix M has an outlier. The overlap 
between the corresponding eigenvector vı and x is given by the residue 


i= = [vlei] = lim aata ' (14.64) 
za z — Mıı — Mı2G22(2)M21 


In the large N limit the scalar equation (14.63) becomes self-averaging. We have 


T 
1 > 
Mii = = È Foolad)? "2 E [foda]; (14.65) 
k=1 


235 


236 


Edge Eigenvalues and Outliers 


For the second term we have 


T N 
1 
Mi2G29(2)Ma1 = YP 7a Sw) Sola Lach }, laxliG22@1ijlael; 
k,e=1 i,j>l 
3% ge] sola)? o, (14.66) 
where g = N/T and 
HH’ 
ma) =+( a Gn), [H]; = [az]; i> 1. (14.67) 


We can now put everything together and use l’Hospital’s rule to compute the residue. For 
convenience we define the constants cy := E [ f" (yal)? ]. There will be an outlier 


with overlap 


1 
= 14.68 
e= T geal) KE 
when there is a solution Ay > A+ of 
Ay = c1 + qcz2h(à1). (14.69) 


We will come back later to the computation of h(z). In the q — O limit the matrix M 
becomes proportional to the identity M = E[ f (y)]J1 := m11, so gy (z) = 1/(z — mı) and 
h(z) = 1/(z — mı). For q = 0 we have a solution 4; = cı which satisfies c} > my. In 
this limit the overlap tends to one. The linear correction in q is easily obtained as we only 
need A(z) to order zero. We obtain 


c2 


aa +0’). (14.70) 


c2 
oha +00"), o=1-q 


Note that c2/(c1 — m1)? is always positive so the overlap decreases with q starting from 
o = l atq = 0. For the unit thresholding function f(y) = O(y — 1) we have mı = 
erfe(1/ V2) x~ 0.317 and cy = c2 = mı + V2/(ex) ~ 0.801 (see Fig. 14.7). 

Since we have the freedom to choose any bounded function f(y) we should choose 
the one that gives the largest overlap for the value of q given by our dataset. We will do an 
easier computation, namely minimize the slope of the linear approximation in q. We want 


Ea | 2(@?)a"| 


fop (y) = argmin ——*—.. = argmin i (14.71) 
fo) €i-m)? FO) Ea[f@2)@2 — DIP 
where the law of a is N (0, 1). A variational minimization gives 
1 
fopt(y) = 1- a (14.72) 


The optimal function is not bounded below and therefore the distribution of eigenvalues is 
singular with c2 —> oo and mı — —oo. One should think of this function as the limit of a 
series of functions such that c2/(cy — m1)? > 0. In Figure 14.7 we see that numerically 
this function has indeed an overlap as a function of q with zero slope at the origin. As a 
consequence it has non-zero overlap for much greater values of q (fewer data T) than the 
simple thresholding function. It turns out that our small q optimum f(y) = 1 — 1/y is 
actually the optimal function for all values of q. 
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= f()= 1- Wy 
e /0)= 00-1) 


— linear approx. 


Figure 14.7 Overlap 9 := vix]? / |x|? between the largest eigenvector and the true signal as a 
function of q := N/T for two functions: the simple f(y) = ©(y — 1) and the optimal f(y) = 
1 — 1/y. Each dot corresponds to a single matrix of aspect ratio g and NT = 10°. The solid line 
corresponds to the linear g approximation Eq. (14.70) in the thresholding case. For the optimal case, 
the slope at the origin is zero. 


For completeness we show here how to compute the function h(z). We have the fol- 
lowing subordination relation (for the (22) block of the relevant matrices): 


tM) = tw, (Sp(qtm(S))S) (14.73) 
where S p(t) is the S-transform of a diagonal matrix with entries f (yg). We then have 
h(z) = t(WqGo2(z)) = St [(Sz1 -= My! M = Sz + S2)| 
= S(zgm(z) — 1), (14.74) 
with S := S¢(q(zgm(z) — 1)). Since 


Sp(qt) t+ 
Sm (t) = Tig = 7 (14.75) 

we have 
h(z) = gM) + ¢(zgm(z) — DI, (14.76) 


where the function gp (z) can be obtained by inverting the relation 


ie) =f a Lad alae 
BO j o oe War) g 


(14.77) 
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Part III 
Applications 


15 
Addition and Multiplication: Recipes and Examples 


In the second part of this book, we have built the necessary tools to compute the spectrum 
of sums and products of free random matrices. In this chapter we will review the results 
previously obtained and show how they work on concrete, simple examples. More sophisti- 
cated examples, and some applications to real world data, will be developed in subsequent 
chapters. 


15.1 Summary 


We introduced the concept of freeness which can be summarized by the following intuitive 
statement: two large matrices are free if their eigenbases are related by a random rotation. 
In particular a large matrix drawn from a rotationally invariant ensemble is free with respect 
to any matrix independent of it, for example a deterministic matrix.! For example A and 
OBO’ are free when O is a random rotation matrix (in the large dimension limit). When A 
and B are free, their R- and S-transforms are, respectively, additive and multiplicative: 


Ra+B(x) = Ra (x) + RB), SaB (t) = Sa (t) SB (t). (15.1) 


The free product needs some clarification as AB is in general not a symmetric matrix, 
the S-transform Sap (t) in fact relates to the eigenvalues of the matrix JAB VA, which are 
the same as those of VBA VB when both A and B are positive semi-definite (otherwise the 
square-root is ill defined). 


15.1.1 R- and S-Transforms 


The R- and S-transforms are defined by the following relations: 


ga@ =t[(@—A)"'], (15.2) 
ta) =e [— gay] - 1 = Soa) 1; (15.3) 
1 1 
Ra (x) = 3a(x) — X Sa (t) = a if t(A) #0, (15.4) 


1 By large, we mean that all normalized moments computed using freeness are correct up to corrections that are O(1/N). 
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where 34 (x) and fa (t) are the inverse functions of ga (z) and ta (¢), respectively. 
Under multiplication by a scalar they behave as 


Raa (Œ) = a Ra (ax), Saa (t) = a7!Sa(t). (15.5) 
While the R-transform behaves simply under a shift by a scalar, 
Ra+a1i(x) = a + Ra (x), (15.6) 


there is no simple formula for computing the S-transform of a shifted matrix. On the other 
hand, the S-transform is simple under matrix inversion: 


1 
Sa- D————— 15.7 
aa = ee ay (15.7) 
The two transforms are related by the following equivalent identities: 
1 
SA) =e Raa) SS (15.8) 
Ra(tSa(t)) Sa (x Ra (x)) 
The identity matrix has particularly simple transforms: 
@=— Wae; (15) 
a=- =T . 
Rı(x)= 1, S4) = 1. (15.10) 
The R- and S-transforms have the following Taylor expansion for small arguments: 
2 K2 2x4 — K{K3 2 
Ra(x) = ki + kx + 3x7 +, Sa(x) = zX + 7 Xo pees, 
K1 kj Kj 
(15.11) 


where k,, are the free cumulants of A: 
ki = (A), k2 = t(A”) — T° (A), k3 = t(A®) — 3T (A)T (AŽ) + 277(A). 
(15.12) 


Combining Eqs. (15.7) and (15.11), we can obtain the inverse moments of A from its 
S-transform. In particular, 


T(AT!) = SA (1), t(A7?) = Sa (= 1) (Sa (1) — S4 (—1)). (15.13) 


15.1.2 Computing the Eigenvalue Density 


The R-transform provides a systematic way to obtain the spectrum of the sum C of two 
independent matrices A and B, where at least one of them is rotationally invariant. Here is 
a simple recipe to compute the eigenvalue density of a free sum of matrices: 


1 Find gg(z) and gą (z). 
2 Invert gg (z) and ga (z) to get 3B (g) and 34 (g), and hence Rg(x) and Ra (x). 
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w 


Rc(x) = Rg(x) + Ra (x), which gives zc (8) = Re(g) — gl. 


4 Solve 3c(g) = z for gc (z). 
5 Use Eq. (2.47) to find the density: 
lim, _.9+ Imgc(A — i 
gije Loe (15.14) 
T 
In the multiplicative case (C = A2BAŻ), the recipe is similar: 
1 Find tg(¢) and ta(¢). 
2 Invert tg(¢) and ta (¢) to get g(t) and fa (t), and hence Sp(t) and Sq (t). 
3 Sct) = Sp(t)Sa(t), which gives f¢(t)Sc@)t =t+ 1. 
4 Solve c(t) = ¢ for te(f). 
5 Equation (2.47) for ge(z) = (te (z) + 1)/z is equivalent to 
lim, _.9+ Imtc(A — i 
poy es eee (15.15) 


TÀ 
In some cases, the equation in step 4 is exactly solvable. But it is usually a high order 
polynomial equation, or worse, a transcendental equation. In these cases numerical solution 
is still possible. There always exists at least one solution that satisfies 


a(z) = z7! +073 (15.16) 


for z —> oo. Since the eigenvalues of B and A are real, their R- and S-transforms are real 
for real arguments. Hence the equation in step 4 is an equation with real coefficients. In 
order to find a non-zero eigenvalue density we need to find solutions with a strictly positive 
imaginary part when the parameter 7 goes to zero. When the equation is quadratic or cubic, 
complex solutions come in complex conjugated pairs: therefore, at most one solution will 
have a strictly positive imaginary part. As a numerical trick, 7(A) can be equated with the 
maximum of the imaginary part of all two or three solutions (the density will be zero when 
all solutions are real). For higher order polynomial and transcendental equations, we have 
to be more careful as there can be spurious complex solutions with positive imaginary part. 
Exercise 15.2.1 shows how to do these computations in concrete cases. 


15.2 R- and S-Transforms and Moments of Useful Ensembles 
15.2.1 Wigner Ensemble 


The Wigner ensemble is rotationally invariant, therefore a Wigner matrix is free from any 
matrix from which it is independent. For a Wigner matrix X of variance o7, the R-transform 
reads 


Rx(x) = 07x. (15.17) 


The Wigner matrix is stable under free addition, i.e. the free sum of two Wigner matrices 
of variance o? and o2 is a Wigner with variance o? + ae 
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The Wigner matrix is traceless (t (X) = 0), so its S-transform is ill-defined. However, we 
can shift the mean of the entries of X by a certain parameter m. We then have Rx+(x) = 
m + 07x. We can use Eq. (11.97) and compute the S-transform: 


Vm2 + 402t -m m 40?t 


Sx4m(t) = D = A a j (15.18) 


It is regular at t = 0 whenever m > 0 and tends to (o VT)! when m — 0. 
Finally, let us recall the formula for the positive moments of Wigner matrices: 
(2k)! o2k 

(k + 1)k!? 

The negative moments of X are all infinite, because the density of zero eigenvalues is 

positive. 


t(X**) = ; t(X%*t) =0. (15.19) 


15.2.2 Wishart Ensemble 


For a white Wishart matrix W, with parameter q = N/T, one has (see Section 10.1) 


Rw, (x) = (15.20) 


1—qx 
To compute its S-transform we first remember that its Stieltjes transform g(z) satisfies Eq. 
(4.37), which can be written as an equation for t(¢) or its inverse ¢ (t): 


t— (1 Ht+1=0 Sw, (t) = : 15.21 
gt- (A +qt)t+1) => Sw, (t) EET ( ) 
The first few moments of a white Wishart matrix are given by 
2 -1 1 -2 1 
t(W,) = 1, t (Wọ) =M] + q; t(W, ) = Isg t(W3 ) = l-g? (15.22) 


15.2.3 Inverse-Wishart Ensemble 


We take the opportunity of this summary of R- and S-transforms to introduce a very 
useful ensemble of matrices, namely the inverse-Wishart ensemble. We will call an inverse- 
Wishart matrix? the inverse of a white Wishart matrix, which, we recall, has unit normalized 
trace. 

For a Wishart matrix to be invertible we need to have q < 1. Let W; be such a matrix. 
Using Eq. (11.116) we can show that 


SOng g. (15.23) 


2 More generally the inverse of a Wishart matrix with any covariance C can be called an inverse-Wishart but we will only 
consider white inverse-Wishart matrices. 
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Since t(W,') = 1/( — q), we define the (normalized) inverse-Wishart as Mp = 
d— q)W;' and call p := q/(1 — q). Rescaling and changing variable we obtain 


Sm, (t) = 1 — pt. (15.24) 


By construction the inverse- Wishart has mean 1 and variance p. Using Eq. (15.11), we find 
that it has «3(Mp,) = 2p*, which is higher than the skewness of a white Wishart matrix 
with the same variance («3(W,) = q”). 

From the S-transform we can find the R-transform using Eq. (15.8): 


1— y1 -— 4px 


15.2 
F (15.25) 


Rm, (x) = 


To find the Stieltjes transform and the density, it is easier to compute the T-transform from 
Eq. (11.63) and convert the result into a Stieltjes transform: 


(1+ 2p)z-1-— S/(@—1)? —4pz 
2 pz? i 


gm, (z) = (15.26) 


We can use Eq. (2.47) to find the density of eigenvalues or do the following change of 


variable in the white Wishart density (Eq. (4.43)): 


i 
eee A ae (15.27) 
À l-q 


Both methods give (see Fig. 15.1) 


yf 4 =X) — x-) 
2m px? 


PM, (x) = , X< X< X44, (15.28) 


with the edges of the spectrum given by 


p=2p+1+2/2(pt+ 1). (15.29) 


From the Stieltjes transform we can obtain 


x4 


t(M7') = -lim gm, (z) =1+ p. (15.30) 
Other low moments of the inverse-Wishart matrix read 


T(M,)=1, T(M)=1+p; (M5) =1+p, tM; = (1+ p)(1 + 2p). 
(15.31) 


Finally, the large N inverse-Wishart matrix potential can be obtained from the real part 
of the Stieltjes transform using (5.38) 


1 142 
Vm, (x) = — + 
px 


log x. (15.32) 
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A 
Figure 15.1 Density of eigenvalues for an inverse-Wishart distribution with p = 1/2. The white 


Wishart distribution (q = 1/2) is shown for comparison. Both laws are normalized, have unit mean 
and variance 1/2. 


For completeness we give here the law of the elements of a general (not necessarily 
white) inverse-Wishart matrix at finite N. We recall the law of a general Wishart matrix 
Eq. (4.16): 


NT/2 (T—N-1)/2 
a ee exp | T ir (ne“")], (15.33) 


= Py (T/2) (det C)7/2 


where E is an N x N Wishart matrix measured over T time steps with true correlations 
C and normalized such that E[E] = C. We define the inverse-Wishart as M = E7!. Note 
that a finite N Wishart matrix has 


E [e7] se is (15.34) 
T-N-1 a i 
where we have defined the matrix © such that E[M] = £. To do the change of variable 


E — M in the joint probability density, we need to multiply by the Jacobian (det M)-N-! 
(see Eq. (1.41)). Putting everything together we obtain 


(T—-N-WNT/2 (det x)? /2 T-N-1 i 
P(M) = Tr(Į™'E) |. 
(m) INT2Py(T/2) (det M) T+N+D/2 P 2 > 
(15.35) 
In the scalar case N = 1, the inverse-Wishart distribution reduces to an inverse-gamma 
distribution: 
paja he (15.36) 
r (a) 


with b = (T — 2)£ /2 and a = T /2. 
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Exercise 15.2.1 Free product of two Wishart matrices 
In this exercise, we will compute the eigenvalue distribution of a matrix E = 
(Wao) 2 Wa (Wa) Z; as we will see in Section 17.1, this matrix would be the 
sample covariance matrix of data with true covariance given by a Wishart with 
qo observed over T samples such that q = N/T. 


(a) Using Eq. (15.21) and the multiplicativity of the S-transform, write the S- 
transform of E. 

(b) Using the definition of the S-transform write an equation for tg(z). It is a 
cubic equation in ¢. If either go or q goes to zero, it reduces to the standard 
Marcenko-Pastur quadratic equation. 

(c) Use Eq. (15.15) and a numerical root finder to plot the eigenvalue density of 
E for go = 1/2 and q € {1/4,1/2,3/4}. In practice you can work with n = 0; 
of the three roots of your cubic equation, at most one will have a positive 
imaginary part. When all three solutions are real pg (à) = 0. 

(d) Generate numerically two independent Wishart matrices with q= 1/2 
(N = 1000 and T =2000) and compute E = (Wa)? Wa (Way)? Note that 
the square-root of a matrix is obtained by applying the square-roots to 
its eigenvalues. Diagonalize your E and compare its density with your 
result in (c). 


15.3 Worked-Out Examples: Addition 
15.3.1 The Arcsine Law 


Consider the free sum of two symmetric orthogonal matrices, i.e. matrices with eigen- 
values +1 with equal weights. Let M; and M3 be two such matrices, their Stieltjes and 
R-transforms are given by 


J1+4g?-1 


= d R(g) = 15.37 
=. Y (8) Zg (15.37) 
from which we can deduce that M = 5(M 1 + M2) has an R-transform given by 
J1lt+g?-1 
Rm(g) = Ss (15.38) 
8 
where we have used the scaling Raa (x) = œ Ra (œx) witha = 1/2. 
The corresponding Stieltjes transform reads 
1 
9M (z) = (15.39) 


zy1= 1/2 
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From this expression we deduce that the density of eigenvalues is given by the centered 
arcsine law: 


1 1 
pm (A) = » Ae(-L)D, (15.40) 
T J/1—)2 
and zero elsewhere. This corresponds to a special case of the Jacobi ensemble that we have 
encountered in Section 7.1.3. 


15.3.2 Sum of Uniform Densities 


Suppose now we want to compute the eigenvalue distribution of a matrix M = U + OUO”, 
where U is a diagonal matrix with entries uniformly distributed between —1 and 1 (e.g. 
[U]kz = 1 + (1 — 2k)/N) and O a random orthogonal matrix. This is the free sum of two 
matrices with uniform eigenvalue density. 

First we need to compute the Stieltjes transform of U. We have 


1 
pu(a) = 7 à e (—1,1). (15.41) 
The corresponding Stieltjes transform is? 
1f! a 1 z+1 
= ==] . 15.42 
wa=5f 5 toe (5*2) (15.42) 


Note that when —1 < à < 1 the argument of the log in gy(z) is negative so Img(A — 
in) = i /2, consistent with a uniform distribution of eigenvalues. We then compute the 
R-transform by finding the inverse of gy(z): 


eS +1 
3(g) = Tae coth(g). (15.43) 
And so the R-transform of U is given by 
1 
Ru(g) = coth(g) — —. (15.44) 
8 


The R-transform of M is twice that of U. To find the Stieltjes transform of U we thus need 
to solve 


1 1 
z = Ry (8g) + a 2 coth(g) — > (15.45) 


for g(z). This is a transcendental equation and we need to solve it for complex z near the 
real axis. Before attempting to do this, it is useful to plot 3(g) (Fig. 15.2). The region where 
Z = 3(g) does not have real solutions is where the eigenvalues are. This region is between 
a local maximum and a local minimum of 3(g). We should look for complex solutions 
of Eq. (15.45) near the real axis for Re(z) between —1.54 and 1.54. We can then put this 
equation into a complex non-linear solver. The density will be given by Im g(z)/z for Im(z) 


3 A more general uniform density between [m — a,m + a] has mean m, variance a2 /3 and 
gu (z) = log((z — m + a)/(z — m — a))/(2a). 
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3 = R(g)+ 1/g 


1.25 1.50 1.75 


Figure 15.2 The function 3(g) = RM(8) + 1/g for the free sum of two flat distributions. Note that 
there is a region of z near [—1.5, 1.5] when z = 3(g) does not have real solutions. This is where the 
eigenvalues lie. The inset shows a zoom of the region near z = 1.5, indicating more clearly that 3(g) 
has a minimum at g+ ~ 1.49, so à+ = 3(g4+). The exact edges of the spectrum are A+ ~ +1.5429. 


0.3 


0.1 


0.0 


Figure 15.3 Density of eigenvalues for the free sum of two uniform distributions. Continuous curve 
was computed using a numerical solution of Eq. (15.45). The histogram is a numerical simulation 
with N = 5000. 


very small and Re(z) in the desired range. Note that complex solutions come in conjugated 
pairs, and it is hard to force the solver to find the correct one. This is not a problem; since 
their imaginary parts have the same absolute value, we can just use 


Img(A — i 
pà) = eae for some small 7. (15.46) 
T 


We have plotted the resulting density in Figure 15.3. 
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15.4 Worked-Out Examples: Multiplication 
15.4.1 Free Product of a Wishart and an Inverse- Wishart 


Consider the free product of a Wishart W, of parameter q and an independent inverse- 
Wishart Mp of parameter p, ie. E = \/M,W, yMp. We already have the building 
blocks: 


1 1— pt 
Sm, (t) = 1 — pt; Sw, (t) = ——— SE(t) = j 15.47 
Mp (t) P w, (t) ET = Sælt) ey ( ) 
leading to 
C+ Dd +4) 
t) = ——_—_—___.. 15.48 
iB) = — (15.48) 
Inverting this relation to obtain tg (¢) leads to a quadratic equation for t: 
(+11 + qt) = tt — pt), (15.49) 
which can be explicitly solved as 
© 2 
z-q-1- +1—=z) -4q +z 
te(z) = = Va rae) (15.50) 
2(q + zp) 
Using Eq. (15.15) finally yields 
A(pA+q)-(-—q-A)? 
nor vapa +4 4 , (15.51) 
2xàÀ(pà +q) 
The edges of the support are given by 
as =|1+4+2p42 V0 FPP]. (15.52) 
One can check that the limit p — O recovers the trivial case Mo = 1 for which the 
Marčenko-Pastur edges indeed read 
At = (1+q)+2 x/q. (15.53) 
Exercise 15.4.1 Free product of Wishart and inverse-Wishart 
(a) Generate numerically a normalized inverse-Wishart M, for p = 1/4 and 


N = 1000. Check that t (Mp) = 1 and T(M2) = 1.25. Plot a normalized 
histogram of the eigenvalues of AA, and compare with Eq. (15.28). 

(b) Generate an independent Wishart W, with q = 1/4 and compute E = 
/M,W, yMp. To compute /Mp, diagonalize Mp, take the square-root of 
its eigenvalues and reconstruct Mq . Check that t (E) = 1 and tT (E2) = 1.5. 
Plot a normalized histogram of the eigenvalues of E and compare with 
Eq. (15.51). 
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(c) For every eigenvector v of E compute & := vM pVk, and make a scatter 
plot of & vs Ax, the eigenvalue of vz. Your scatter plot should show a noisy 
straight line. We will see in Chapter 19, Eq. (19.49), that this is related to the 
fact that linear shrinkage is the optimal estimator of the true covariance from 
the sample covariance when the true covariance is an inverse- Wishart. 


15.4.2 Free Product of Projectors 


As our last simple example, consider a space of large dimension N, and a projector Pı 
on a subspace of dimension Nı < N, i.e. a diagonal matrix with N; diagonal elements 
equal to unity, and N — N; elements equal to zero. We now introduce a second projector 
P2 on a subspace of dimension N2 < N, and would like to study the eigenvalues of the 
free product of these two projectors, P = P| P». Clearly, all eigenvalues of P must lie in the 
interval (0, 1). 

As now usual, we first need to compute the T-transform of Pı and P2. We define the 
ratios qa = Na/N, a = 1,2. Since Pg has Na eigenvalues equal to unity and N — Na 
eigenvalues equal to zero, one finds 


1 Na N — Na qa— 1 +z qa 
gp, (z) = + = > t = . 15.54 
ap, (2) ral a | a RO= y 0539 
Therefore, the inverse of the T-transforms just read ¢p, (t) = 1 + qa/t, and 
t+1 t+1)? 
Sp, (t) = a a oe _. (15.55) 
t+ qa (t+ qi)t + q2) 
Now, going backwards, 
(t+ qi)(t +a) 
t) = ————_—__, 15.56 
op(t) GD ( ) 
again leading to a quadratic equation for tp(¢): 
G= DP +E -=q =t -qa =0, (15.57) 
whose solution is 
_ ƏZ 72 _2 ry 
y= G@tB—Ht Vt? — 26 (a1 + 92 — 2G192) + (Gi — 2) (15.58) 


2(¢ — 1) 
where the notation &/- defined by Eq. (4.56) ensures that we pick the correct root. Note 
that the argument under the square-root has zeros for 


At =q + Qo — 2q1¢2 £2 yqq — q1) — q2). (15.59) 


One can check that A_ > 0, the zero bound being reached for qı = q2. Note also that 
à+- = (q1 - q. 
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The Stieltjes transform of P can thus be written as 


1 @tqa—2t+ V2 =22(q1 +q- 29192) + (G1 — 2)” 
gph) = —+ . 
z 2z(z — 1) 


(15.60) 


This quantity has poles at z = 0 and z = 1, and an imaginary part when z € (A+,A_). The 
spectrum of P therefore has a continuous part, given by 


Va- NAA) 
2mA(1 — A) 


pp(a) = , (15.61) 
and two delta peaks, Agd(A) and A1ô(à — 1). To find the amplitude of the potential poles, 
we need to compute the numerator of (15.60) at the values z = 0 and z = 1. Remember 
that &/ equals — „/⁄ on the real axis left of the left edge and ,/- right of the right edge. 
The amplitude of the z = 0 pole of gp(z) is 


(qi +q) — vq — 2)” 2 


2 


A = 1 1 — min(q1,92), (15.62) 


while the amplitude of the z = 1 pole is 


_(a+n- D+ V1 -2q +p — 241) + (qı — a)" 


2 
—@t+a-)+Vq+qa-—1)? 
2 


Ai 


= max(qı + q2 — 1,0). (15.63) 


This makes a lot of sense geometrically: our product of two projectors can only have a unit 
eigenvalue if the sum of the dimensions of space spanned by these two projectors exceeds 
the total dimension N, i.e. when Nı + N2 > N. Otherwise, there cannot be (generically) 
any eigenvalue beyond à+. 

When qı + q2 < 1, the density of non-zero eigenvalues (15.61) is the same (up to 
a normalization) as the density of eigenvalues of a Jacobi matrix (7.20). If we match the 
edges of the spectrum we find the identification cy = qmax/4min and c+ = 1/qmin. The ratio 
of normalization 1/c4 = 4min implies that the product of projectors density has a missing 
mass of 1 — qmin, which is precisely the Dirac mass at zero. The special case q1 = q2 = 1/2 
was discussed in the context of 2 x 2 matrices in Exercise 12.5.2. In that case, half of the 
eigenvalues are zero and the other half are distributed according to the arcsine law: the 
arcsine law is the limit of a Jacobi matrix with cı — 1 and c} — 2. 


There is an alternative, geometric interpretation of the above calculation that turns out 
to be useful in many different contexts, see Section 17.4 for an extended discussion. The 
eigenvectors of projector P; form a set of Nj orthonormal vectors xy, a = 1,..., N1, 
from which an N1 x N matrix X of components (xq); can be formed. Similarly, we define 
an N2 x N matrix Y of components (yg);, B = 1, ...,N2. Now, one can write P as 


P=xX’Xy’y. (15.64) 
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The non-zero eigenvalues of P are the same as the non-zero eigenvalues of MTM (or those 
of MM‘), where M is the N; x N2 matrix of overlaps: 


N 
Mo, g = > adi Opi- (15.65) 
i=1 


The eigenvalues of P correspond to the square of the singular values s of M. The geo- 
metrical interpretation of these singular values is as follows: the largest singular value 
corresponds to the maximum overlap between any normalized linear combination of the 
Xq on the one hand and of the yg on the other hand. These two linear combinations define 
two one-dimensional subspaces of the spaces spanned by Xg and yg. Once these optimal 
directions are removed, one can again ask the same question for the remaining N4 — 1 and 
Na — 1 dimensional subspaces, defining the second largest singular value of M, and so on. 


15.4.3 The Jacobi Ensemble Revisited 


We saw in the previous section that the free product of two random projectors has a 
very simple S-transform and that its non-zero eigenvalues are given by those of a Jacobi 
matrix. We suspect that the Jacobi ensemble has itself a simple S-transform. Rather than 
computing its S-transform from its Stieltjes transform (7.18), let us just use the properties 
of the S-transform to compute it directly from the definition of a Jacobi matrix. 

Recall from Chapter 7, an N x N Jacobi matrix J is defined as (1 + E)~! where the 
matrix E is the free product of the inverse of an unnormalized Wishart matrix W, with 
Tı = cı N and another unnormalized Wishart W2 with Ty = c2 N. 

The two Wishart matrices have S-transforms given by 


1 1 
Sw,.() = T1,2 SN : (15.66) 
1,2 l+c,3t c12 +t 
Using the relation for inverse matrices (15.7), we find 
Sw- (t) = NT! (cq -t -— 1). (15.67) 
The S-transform of E is just the product 
a E A a (15.68) 
EC) = Swe W(t) = ant ; 


The next step is to shift the matrix E by 1. As mentioned earlier, there is no easy rule 
to compute the S-transform of a shifted matrix. So this will be the hardest part of the 
computation. 

The R-transform behaves simply under shift. The trick is to use one of Eqs. (15.8) to 
write an equation for Rg (x), shift E by 1 and use the other R-S relation to find back an 
equation for Sp414(¢). First we write 


(2 +0)Sp=e,—-t-1. (15.69) 


The second of Eqs. (15.8) can be interpreted as the replacements S$ > 1/R and t > xR 
and gives 


(1 +x-—ciı + Rge)Re +c = 0. (15.70) 
Now Rg(x) = Rg+1 x) — 1, so 
(x — cy + Re+1)(Re+1 — 1) +2 = 0. (15.71) 


255 


256 


Addition and Multiplication: Recipes and Examples 


Following the first of Eqs. (15.8), we make the replacements R —> 1/S and x — tS and 
find 


l1— cı +t+(c&2 +c] —t—-DS =0 Sg+1(t) = ea 
CI c2 Tel E+1 = z E+1 tpl- 
(15.72) 
Finally, using the relation for inverse matrices, Eq. (15.7) gives 
t+c& +c] 
S4) = ————_.. 15.73 
y(t) rier ( ) 
We can verify that the T-transform of the Jacobi ensemble 
c1 +1- c4 + Sfc% 2 — eret +04 — 2c) + (er — 1)? 
ty@) = (15.74) 


2(¢ — 1) 


is compatible with our previous result on the Stieltjes transform, Eq. (7.18). We can use 
the Taylor series of the S-transform (15.11) to find the first few cumulants: 


prne , k= a k3 = Torn (15.75) 
cy +c? (c1 + c2) (c1 + c2) 
From the S-transform we can compute the R-transform using Eq. (15.8): 
L eaa 2 2c -c + my 
goe Ie Cad GSO Tee o 


2x 


Finally, we note that the arcsine law is a Jacobi matrix with c} = c2 = 1 and has the 
following transform: 


t+2 —2— Vx? +4 
sOn A Ree (15.77) 
t+1 2x 


For the centered arcsine law we have R(t) = 2R(2x) + 1 and we recover Eq. (15.38). 


16 


Products of Many Random Matrices 


In this chapter we consider an issue of importance in many different fields: that of prod- 
ucts of many random matrices. This problem arises, for example, when one considers the 
transmission of light in a succession of slabs of different optical indices, or the propagation 
of an electron in a disordered wire, or the way displacements propagate in granular media. 
It also appears in the context of chaotic systems when one wants to understand how a 
small difference in initial conditions “propagates” as the dynamics unfolds. In this context, 
one usually linearizes the dynamics in the vicinity of the unperturbed trajectory. If one 
takes stroboscopic snapshots of the system, the perturbation is obtained as the product 
of matrices (corresponding to the linearized dynamics) applied on the initial perturbation 
(see Chapter 1). If the phase space of the system is large enough, and the dynamics chaotic 
enough, one may expect that approximating the problem as a product of large, free matrices 
should be a good starting point. 


16.1 Products of Many Free Matrices 


The specific problem we will study is therefore the following: consider the symmetrized 
product of K matrices, defined as 


Mx = AxAx_1... AoA ATAS...A%_ AZ, (16.1) 


where all A; are identically distributed and mutually free, i.e. randomly rotated with respect 
to one another. We know now that in such a case the S-transforms simply multiply. Noting 
as S; (z) the S-transform of A; Aĵ, and Sm, (z) the S-transform of Mx, one has 


K 
SMO =T]Si@ = 1@*. (16.2) 


i=1 


Now, it is intuitive that all the eigenvalues of Mx will behave for large K as fies , where u 
is itself a random variable which we will characterize below. We take this as an assumption 
and indeed show that the distribution of jz’s tends to a well-defined function poo() as 
K — œ. Note here a crucial difference with the case of sums of random matrices. If we 
assume that the eigenvalues of a sum of K free random matrices behave as K x u, one can 
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easily establish that the distribution of jz’s collapses for large K to ô (u — t(A)), with, once 
again, T(.) = Tr(.)/N. For products of random matrices, on the other hand, the distribution 
of u remains non-trivial, as we will find below. 

Let us compute Sm; (z) in the large K limit using our ansatz that the eigenvalues of M 
are indeed of the form u” . We first compute the function tx (z) equal to 


vos, du = d 16.3 
KC) = f zeo) n=- f ieo) ML. (16.3) 


Setting z := u% 


region u < u, whereas the region u > u simply yields 


, we see that for K — œœ there is no contribution to this integral from the 


te(z) ~ -P> (z"£); Pw) = / Poo (HW) dm. (16.4) 


The next step to get the S-transform is to compute the functional inverse of tx (z). Within 
the same approximation, this is given by 


T'O = [P], (16.5) 


where pO” is the functional inverse of the cumulative distribution function P>. Finally, 
by definition, 


1+2z 


Sy (@) = —— = Si@)*. (16.6) 
ztg (2) 
Hence one finds, in the large K limit where ((1 + z)/z)!/* > 1, 
PEP (—z) = re (w) =- se” (=) (16.7) 
- Si@) 7 * l w 
and finally pœ(u) = —P! (u). The final result is therefore quite simple, and entirely 


depends on the S-transform of A; Aj. 

A simple case is when A;A/ is a large Wishart matrix, with parameter q < 1. In this case 
Si(z) = (1 +qz)7, from which one easily works out that pœ (u) = 1/q for u € (1—q,1) 
and zero elsewhere (see Fig. 16.1 for an illustration). 

In many cases of interest, the eigenvalue spectrum of A;Aj has some symmetries, 
coming from the underlying physical problem one is interested in. For example, when our 
chaotic system is invariant under time reversal (like the dynamics of a Hamiltonian system), 
each eigenvalue À must come with its inverse A~!. A simple example of a spectrum with 
such a symmetry is the free log-normal, further discussed in the next section. It is defined 
from its S-transform, given by 


SP u(z) = eo MD, (16.8) 
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2.5 


pu) 


0.4 0.6 0.8 1.0 1.2 
u= Ak 


Figure 16.1 Sample density of u = à1/K for the free product of K = 40 white Wishart matrices 
with q = 1/2 and N = 1000. The dark line corresponds to the asymptotic density (K — 00), which 
is constant between 1 — q and 1 and zero elsewhere. The two dashed vertical lines give the exact 
positions of the edges of the spectrum (u— = 0.44 and w+ = 1.10) for K = 40, as computed in 
Exercise 16.1.1. 


where the parameter a is related to the trace of the corresponding matrices equal to e@/?. 
Multiplying K such matrices together leads to eigenvalues of the form u¥ , with 


(-1) /1 1 logy 
Pages (tx) (=) =>- = (16.9) 
corresponding to 
1 
pæœln) = -PL (u) = T we (e497, 6%/), (16.10) 


and zero elsewhere. One can explicitly check that ~! has the same probability distribution 
function as p. 

One often describes the eigenvalues of large products of random matrices in terms of 
the Lyapunov exponents A, defined as the eigenvalues of 


1 
A= lim —logMg. 16.11 
aos le ( ) 


Therefore the Lyapunov exponents are simply related to the w’s above as A = log u. For 
the free log-normal example, the distribution of A is found to be uniform between —a/2 
and a/2. 

Let us end this section with an important remark: we have up to now considered products 
of K matrices with a fixed spectrum, independent of K, which leads to a non-universal 
distribution of Lyapunov exponents (i.e. a distribution that explicitly depends on the full 
function Sj (z)). Let us now instead assume that these matrices are of the form 
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AAT = (1 ieee (16.12) 


B 
14 
xx) a JK’ 


where a is a parameter and B is traceless and characterized by its second cumulant b = 
t(B?). For large K, S|(z) can then be expanded as 


Si cie 2 ee) (16.13) 
1z) = IK K? oO ä a 


Therefore, for large K, the product of such matrices converges to a matrix characterized by 


b NE i 
sos (1 oF =z) ia (16.14) 


which can be interpreted as a multiplicative CLT for free matrices, since the detailed statis- 
tics of B has disappeared. The choice b = a corresponds to the free log-normal with 
inversion symmetry Sie (see next section). 


Exercise 16.1.1 Edges of the spectrum for the free product of many white 
Wishart matrices 
In this exercise, we will compute the edges of the spectrum of eigenvalues of 
a matrix M given by the free product of K large white Wishart matrices with 
parameter q. 


(a) The S-transform of M is simply given by the S-transform of a white Wishart 
raised to the power K. Using Eq. (11.92), write an equation for the inverse of 
the T-transform, ¢ (t), of the matrix M. This is a polynomial equation of order 
K+1. 

(b) For odd N, plot ¢(¢) for various K and O < q < 1 and convince yourself that 
there is always a region of ¢ where ¢(t) = ¢ has no real solution. This region 
is between a local maximum and a local minimum of ¢(t). For even N, the 
argument is more subtle, but the correct branch exists only between the same 
two extrema. 

(c) Differentiate ¢(t) with respect to ¢ to find an equation for the extrema of ¢ (t). 
After simplifications and discarding the £ = —1/q solution, this equation is 
quadratic in t* with two solutions corresponding to the local minimum and 
maximum. Find the two solutions t¥ and plug these back in your equation for 
¢(t) to find the edges of the spectrum A+. 

(d) Use your result for K = 40 and q = 0.5 to verify the edges of the spectrum 
given in Figure 16.1. 

(e) Compute the large K limit of t{. You should find t* — —1 and tt > 


(q(K — 1))~!. Show that at large K we have nes — 1 — q and tee > 1. 
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16.2 The Free Log-Normal 
There exists a free version of the log-normal. Its S-transform is given by 
Stn(t) = 4/2 FF (16.15) 


As a two-parameter family, the free log-normal is stable in the sense that the free product of 
two free log-normals with parameters a1, b; and ap, b2 is a free log-normal with parameters 
a = a,+4a2,b = bı + b2. The first three free cumulants can be computed from Eq. (16.15): 


1 
Sint) = e7 4/2 [i — bt + z| +0(t°). (16.16) 
Comparing with Eq. (15.11), this leads to 
Ky = e? 
= a 
= de’, (16.17) 
3b? 
k3 = —e”, 
2 


In the special case b = a, the free log-normal SPa has the additional property that its 
matrix inverse has exactly the same law. Indeed, we have shown in Section 11.4.4 that the 
following general relation holds: 


1 
Sy-1(t) = —————_., 16.18 
MO = sp (16.18) 
or, in the free log-normal case with a = b, 
Sy- (6) = e2 0+0 = Sp(t) (16.19) 


when b = a. This implies that the eigenvalue distribution is invariant under à —> 1/À and 
therefore that M has unit determinant. Let us study in more detail the eigenvalue spectrum 
for the symmetric case a = b. By looking for the real extrema of 


t+1 
= e012, (16.20) 


we can find the points t+ where f(¢) ceases to be invertible, which in turn give the edges 
of the spectrum à+ = ¢ (t+): 


+V/1+ aF 1 

t = — 16.21 

+ 7 ( ) 
or 

yad jit fits i git (16.22) 

=—= — —| ex — |]. : 
Ss 4 4| P| aa 
Note that à+ = A_ = 1 when a = b = O, corresponding to the identity matrix. The 


eigenvalue distribution is symmetric in à —> 1/A so the density p (£) of £ = log(A) is even. 
Figure 16.2 shows the density of £ for a = 100. 
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g 
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0.002 | free log-normal 


—— Wigner 


0.000 


Figure 16.2 Probability density of £ = log(A) for a symmetric free log-normal (16.15) witha = b = 
100 compared with a Wigner semi-circle with the same endpoints. As expected, the distribution is 
even in £. Fora < 1 the density of £ is indistinguishable to the eye from a Wigner semi-circle (not 
shown), whereas for a — oo the distribution of £/a tends to a uniform distribution on [—1/2, 1/2]. 


In the more general case a + b, the whole distribution of £ = log(A) is just shifted by 
(a — b)/2, as expected from the scaling property of the S-transform upon multiplication by 
a scalar. 


16.3 A Multiplicative Dyson Brownian Motion 


Let us now consider the problem of multiplying random matrices close to unity from a 
slightly different angle. Consider the following iterative construction for N x N matrices: 


1 1 
Masi = M [0 + D1 + VEB, | Mi, (16.23) 


where B,, is a sequence of identical, free, traceless N x N matrices and € « 1. Using 
second order perturbation theory, one can deduce an iteration formula for the eigenvalues 
Ài n Of Mn, which reads 


Ài,nÀj,n AA Bn Vj, n)? 


i,n 


Kin = Àj,n 


aE 
itt = Ain (14 5 + JEW, „BaVin) +e D , (16.24) 
j+i 
where V; n are the corresponding eigenvectors. Noting that M, and B, are mutually free 
and that t(B,,) = 0, one has, in the large N limit (using, for example, Eq. (12.8)), 


b 
N’ 


(16.25) 


~ ~ 2 
ELV; nBnVi,n] = 0; L(V)» BnVj,n) 1 = 
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where b := t(BŻ). Choosing € = df, an infinitesimal time scale, we end up with a 
multiplicative version of the Dyson Brownian motion in fictitious time t: 

di; a b Aid; b 

T a a a (16.26) 


where é; is a Langevin noise, independent for each A; (compare with Eq. (9.9)). 
Now, let us consider the “time” dependent Stieltjes transform, defined as usual as 


1 1 
g(z,t) = = ——_.. 16.27 
ot) N z uO ( ) 
Its evolution is obtained as 
0g 1 1 dA; 1 3 1 dà; 
= = . 16.28 
ot TAE dt NG dt ( ) 


i i 
After manipulations very similar to those encountered in Section 9.3.1, and retaining only 
leading terms in N, one finally obtains 


0g la 2.2 
= 2b b | l 16.29 
ay Szy 2- e9— bes (16.29) 
Now, introduce the auxiliary function A(£,t) := efg(ef ,t) +a/2b — 1, which obeys 
oh oh 
— = —bh—. 16.30 
ot OL ( ) 


This is precisely the Burgers’ equation (9.37), up to a rescaling of time t — bt. Its solution 
obeys the following self-consistent equation obtained using the method of characteristics 
(see Section 10.1): 


h(é,t) = ho(é — bth(£,t)); ho(€) := h(£,0) = p + aE 1, (16.31) 
l1—e& 2b 
where we have assumed that at time tf = 0 the dynamics starts from the identity matrix: 
Mo = 1, for which g(z,0) = (z — 1)7!. Hence, with z = ef, 
1 
oD) = eaen 


(16.32) 


Now, let us compare this equation to the one obeyed by the Stieltjes transform of the free 
log-normal. Injecting t = zgLn — 1 in 


t+1 
Z= (16.33) 
tS~n(t) 
and using Eq. (16.15), one finds 
1 
Z9LN — 1 = gyet T HEN > gjy = (16.34) 


z — ebzatnta/2—b’ 


which coincides with Eq. (16.32) for t = 1, as it should. For arbitrary times, one finds that 
the density corresponding to the multiplicative Dyson Brownian motion, Eq. (16.26), is the 
free log-normal, with parameters ta and tb. 
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16.4 The Matrix Kesten Problem 


The Kesten iteration for scalar random variables appears in many different situations. 
It is defined by 


Zn+1 = Zn (l + Zn), (16.35) 
where zn are IID random variables. In the following, we will assume that 
zn =l+em+t JSeonn, (16.36) 


where € < 1 and 7, are HD random variables, of zero mean and unit variance. Setting 
Zn = Uy /e and expanding to first order in £, one obtains 


U 
Uns, =e tem + JSeonn) (: + =) = Un temUn + JéonnUy +e (16.37) 


or, in the continuous time limit dt = £, the following Langevin equation: 


dU 
ae 1+mU +onU. (16.38) 
The corresponding Fokker—Planck equation reads 


2 92 


aP(U,t) a o ð 2 
= = (+ mv] + [UP]. 16.39 
at ay OIE ep ee 
This process has a stationary distribution provided the drift m is negative. We will thus 
write m = —m with m > 0. The corresponding stationary distribution Peq(U) obeys 
d ara 2 [vr] (16.40) 
a oa ' i 
which leads to 
a BS 
2H e o2U i: 2 
Peg(U) = b:=1+2m/o-," (16.41) 


T (u)o?t Ul+u ’ 


to wit, the distribution of U is an inverse-gamma, with a power-law tail U TI—M with a 
non-universal exponent u = 1 + 2m /o?. 
Now we can generalize the Kesten iteration for symmetric matrices as 


U U 
Ung = €f1+ — (1 + me)1 + JeoB) 1+ —, (16.42) 
& € 


Un41 — Un = £ (1 + MUn) +0 Ve YUnB yUn. (16.43) 


Following the same steps as in the previous section, we obtain a differential equation for 
the eigenvalues of U (where we neglect the noise when N — oo): 


1 


or 


, (16.44) 


where we again assume that m < 0 in order to find a stationary state for our process. The 
corresponding evolution of the Stieltjes transform reads, for large N, 


! The results of this section have been obtained in collaboration with T. Gautié and P. Le Doussal. 
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ðg ə 


A 1 
Bf Oz E F (o? + m)z9 — T . (16.45) 


If an equilibrium density exists, its Stieltjes transform must obey 


1 A 
50 eo + (1 - (0° + Â)8 +C =0, (16.46) 
where C is a constant determined by the fact that zg > 1 when z —> oo. Hence, 
1 2 É 
C= 57 +m. (16.47) 
From the second order equation on g one gets 
= l 2 A ®/ noo 2 A 
s= 35 ((o~ +m)z—1)— mz —2(o0~+m)z+1]. (16.48) 
Oonz% 


As usual, the density of eigenvalues is non-zero when the square-root becomes imaginary. 
The edges are thus given by the roots of the second degree polynomial inside the square- 
root, namely 


2 A 9; 2 A 
R 2 
Hr cama (16.49) 
m 


So only when m — 0 can the spectrum extend to infinity, with a power-law decay as 
273/2. Otherwise, the power law is truncated beyond 20? / M2. Note that, contrary to the 
scalar Kesten case, the exponent of the power law is universal, with u = 1/2. 

In fact, if one stares at Eq. (16.48), one realizes that the stationary Kesten matrix U is an 
inverse-Wishart matrix. Indeed, the eigenvalue spectrum given by Eq. (16.48) maps into 
the Maréenko—Pastur law, Eq. (4.43), provided one makes the following transformation: 


2 1 
A> x = —— L. 16.50 
o? + 2m À wee 
The parameter q of the Maréenko-Pastur law is then given by 
es Saa (16.51) 
= ——— = - <l. ; 
z o2 +2m H 


Although not trivial, this result is not so surprising: since Wishart matrices are the matrix 
equivalent of the scalar gamma distribution, the matrix equivalent of the Kesten variable 
distributed as an inverse-gamma, Eq. (16.41), is an inverse-Wishart. 
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Sample Covariance Matrices 


In this chapter, we will show how to compute the various transforms (S(t), t(z), 9(z)) for 
sample covariance matrices (SCM) when the data has non-trivial true correlations, i.e. is 
characterized by a non-diagonal true underlying covariance matrix C and possibly non- 
trivial temporal correlations as well. More precisely, N time series of length T are stored 
in a rectangular N x T matrix H. The sample covariance matrix is defined as 


1 TF 
E = -HH . (17.1) 

T 
If the N time series are stationary, we expect that for T >> N, the scm E converges to the 
“true” covariance matrix C. The non-trivial correlations encoded in the off-diagonal ele- 
ments of C are what we henceforth call spatial (or cross-sectional) correlations. But the T 
samples might also be non-independent and we will also model these temporal correlations. 
Of course, the data might have both types of correlations (spatial and temporal). 

We will be interested in the eigenvalues {àz} of E and their density pg(à), which we 
will compute from the knowledge of its Stieltjes transform gg (z) using Eq. (2.47). We can 
also compute the singular values {s} of H; note that these singular values are related to the 
eigenvalues of E via sz = ./TXx. 


17.1 Spatial Correlations 


Consider the case where H are multivariate Gaussian observations, drawn from N(0,C). 
We saw in Section 4.2.4 that E is then a general Wishart matrix with column covariance C, 
and can be written as 


E=C?W,C?. (17.2) 


We recognize this formula as the free product of the covariance matrix C and a white 
Wishart of parameter g, W,. Note that since the white Wishart is rotationally invariant, it 
is free from any matrix C. From the multiplicativity of the S-transform and the form of the 
S-transform of the white Wishart (Eq. (15.21)), we have 

Sct) 


Selt) = ange (17.3) 
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We can also use the subordination relation of the free product (Eq. (11.109)) to write this 
relation in terms of T-transforms: 


te(z) = te (Z@)), Z(Z) = (17.4) 


a 
1+ qtg(z) 
This last expression can be written in terms of the more familiar Stieltjes transform using 
t(z) = za(z) — 1: 
Z 
zgE(z) = Zgc(Z), where Z = -——W—_. (17.5) 
1 — q + qz% (2) 

This is the central equation that allows one to infer the “true” spectral density of C, poc (à), 
from the empirically observed spectrum of E. Note that this equation can equivalently be 
rewritten in terms of the spectral density of C as 


pc(u)du 
B , 17.6 
gE (z) / z— u(l-— q + qz9(2)) aa 


We will see in Chapter 20 some real world applications of this formula. One of the most 
important properties of Eq. (17.5) is its universality: it holds (in the large N limit) much 
beyond the restricted perimeter of multivariate Gaussian observations H. In fact, as soon 
as the observations have a finite second moment, the relation between the “true” spectral 
density oc and the empirical Stieltjes transform gg (z) is given by Eq. (17.5). 

Let us discuss some interesting limiting cases. First, when q — 0, i.e. when T > N, 
one expects that E ~ C. This is indeed what one finds since in that limit Z = z + O(q); 
hence g¢(z) = gc (z) and pg = pc. 

Second, consider the case C = 1, for which g¢(Z) = 1/(Z — 1). We thus obtain 


(2) Z z 1 
ZOE(Z) = = > 
Z—-1 (z=1+q-qz%æk)) — gE) 
which coincides with Eq. (4.37). In the next exercise, we consider the case where C is an 
inverse-Wishart matrix, in which case some explicit results can be obtained. 
We can also infer some properties of the spectrum of E using the moment generating 


= zl +q — qz9g (z), (17.7) 


function. The T-transform of E can be expressed as the following power series for z > oo: 
[00] 
ky „—k 
te@) <> Ze eo (17.8) 


We thus deduce that 


z 
> a 
z>% 1+ q Sophy T(E) 


Z(z) 


Therefore we have, for z > ov, 


o0 k o0 k 
tc(Z@) —> D ue ( Fay renz) , (17.9) 
k=1 


k 
z l=1 
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Hence, one can thus relate the moments of pg with the moments of pc by taking z —> oo 
in Eq. (17.4), namely 


[e9] [e6] 


k k = i 
` E De: 5 a ) (: +a yoreebe (17.10) 


k=1 = k=1 é=1 


In particular, Eq. (17.10) yields the first three moments of pg: 
T(E) = r(C), 
T(E’) = r(C*) +q, (17.11) 
t(E*) = r(C?) + 3qt(C’) + q. 
We thus see that the mean of E is equal to that of C, whereas the variance of E is equal to 
that of C plus q. As expected, the spectrum of the sample covariance matrix E is always 
wider (for q > 0) than the spectrum of the population covariance matrix C. 


Another interesting expansion concerns the case where q < 1, such that E is invertible. 
Hence gg (z) for z — 0 is analytic and one readily finds 


[o,@) 
we) > -rT (E=) ae (17.12) 
ad k=1 


This allows us to study the moments of E~!, which turn out to be important quantities for 
many applications. Using Eq. (17.5), we can actually relate the moments of the spectrum 
E~! to those of C~!. Indeed, for z > 0, 


Z(z) = 


eG 
1—q -q yt (E=) z4 


Hence, we obtain the following expansion: 


o0 o0 k k 
D)e- > (c) (7) (r) _ (17.13) 


k=1 


After a little work, we get 
t (C7! t (C7? t (©)? 
r (E~) = ( ) r (E?) = ( ) 4 ( ) . (17.14) 
I-q (1 —4q) (—q) 
We will discuss in Section 20.2.1 a direct application of these formulas: t(E~!) turns out 
to be related to the “out-of-sample” risk of an optimized portfolio of financial instruments. 


Exercise 17.1.1 The exponential moving average sample covariance matrix 
(EMA-SCM) 

Instead of measuring the sample covariance matrix using a flat average over 

a fixed time window 7, one can compute the average using an exponential 
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(a) 


(b) 


(c) 


(d) 


(e) 
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weighted moving average. Let us compute the spectrum of such a matrix in the 
null case of 11D data. Imagine we have an infinite time series of vectors of size N 
{x,} for t from minus infinity to now. We define the EMA-SCM (on time scale Te) as 


t 
E(t) = yo X = ye) xy x}, (17.15) 


t'=—00 
where ye := 1/te. Hence, 
E(t) = (1 — ye)E(t — 1) + yeXr xp. (17.16) 


The second term on the right hand side can be thought of as a Wishart matrix 
with T = 1 (org = N). Now, both E(t) and E(t — 1) are equal in law so we 
write 


in law 
E = (1-,)E+ %W=n. (17.17) 


Given that E and W are free, use the properties of the R-transform to get the 
equation 


Re(x) = (1 — ye)Re(( — yc)x) + yel — Nycx). (17.18) 


Take the limit N — œ, Te — oo with q := N/T, fixed to get the following 
differential equation for Rg (x): 


RE(x) = =x Reta) + (17.19) 


1— qx 


The definition of E is properly normalized, t(E)=1 [show this using 
Eq. (17.17)], so we have the initial condition R(O) = 1. Show that 
log(1 — 
eG ee ea (17.20) 
qx 


solves your equation with the correct initial condition. Compute the variance 
K2(E). 

To compute the spectrum of eigenvalues of E, one needs to solve a complex 
transcendental equation. First write 3(g), the inverse of g(z). For q = 1/2 plot 
3 as a function of g (for —4 < g < 2). You will see that there are values of 
z that are never attained by 3(g), in other words g(z) has no real solutions for 
these z. Numerically find complex solutions for g(z) in that range. Plot the 
density of eigenvalues pg(à) given by Eq. (2.47). Plot also the density for a 
Wishart with the same mean and variance. 

Construct numerically the matrix E as in Eq. (17.15). Use N = 1000, te = 
2000 and use at least 10000 values for t’. Plot the eigenvalue distribution of 
your numerical E against the distribution found in (d). 
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17.2 Temporal Correlations 
17.2.1 General Case 


A common problem in data analysis arises when samples are not independent. Intuitively, 
correlated samples are somehow redundant and the sample covariance matrix should 
behave as if we had observed not T samples but an effective number T* < T. Let us 
analyze more precisely the sample covariance matrix in the presence of correlated samples. 
We will start with the case when the true spatial correlations are zero, i.e. C = 1. Our data 
can then be written in a rectangular N x T matrix H satisfying 


H; H js] = ôij Krs, (17.21) 


where K is the T x T temporal covariance matrix that we assumed to be normalized as 
t(K) = 1. Following the same arguments as in Section 4.2.4, we can write 


H = HoK?, (17.22) 


where Ho is a white rectangular matrix. So the sample covariance matrix becomes 


E= | HH’ = a KHÍ (17.23) 
Sap eae 
Now this is not quite the free product of the matrix K and a white Wishart, but if we define 


the (T x T) matrix F as 


1 1 
F = —H'H = TK? Hj HoK? = K?W,/,K?, (17.24) 
then F is the free product of the matrix K and a white Wishart matrix with parameter 1/q. 


Hence, 


S(t) 


B= l+t/q 


(17.25) 


To find the S-transform of E, we go back to Section 4.1.1, where we obtained Eq. (4.5) 
relating the Stieltjes transforms of E and F. In terms of the T-transform, the relation is even 
simpler: 


trz) = qtk(qz) => Se(t) = qir(qt), (17.26) 


where the functions ¢(f) are the inverse T-transforms. Using the definition of the 
S-transform (Eq. (11.92)), we finally get 


Sk(qt) 
1+qt’ 


SE(t) = (17.27) 


which can be expressed as a relation between inverse T-transforms: 


tet) = q+ tx (qt). (17.28) 
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We can also write a subordination relation between the T-transforms: 


qtr (z) = tk ( (17.29) 


Z 
q(l + =o) 
This is a general formula that we specialize to the case of exponential temporal correlations 
in the next section. Note that in the limit z —> 0, the above equation gives access to t(E~!). 
Using 


te (z) 5a —1 — t (E7 !)z + 0(2”), (17.30) 
we find 
1 
ET!) = -——__., 17.31 
PE A CO 


17.2.2 Exponential Correlations 


The most common form of temporal correlation in experimental data is the decaying expo- 
nential, corresponding to a matrix K;, in Eq. (17.21) given by 


Kss := al, (17.32) 
where 1/ log(a) defines the temporal span of the correlations. 
In Appendix A.3 we explicitly compute the S-transform of K. The result reads 
t+1 


J1+ (2 —De2+ bt 


where b := (1 + a*)/(1 — a’). From Sx one can also obtain ¢x and its inverse tg, which 
read 


Sk(t) = 


(17.33) 


J1+(@—DP ie 1 


k(t) = b, tk) = . (17.34) 
t yt? —2¢b+1 
Combining Eq. (17.27) with Eq. (17.33), we get 
1 
SE(t) = ; (17.35) 
V1 + (= DN + bat 
From the S-transform, we find 
1+t 1+t 
t) = = — 14+ (62-1 24 bqt 17. 
(0 = ee (V + (B= 1)qt)? + ar); (17.36) 


which when inverted leads to a fourth order equation for tg (z) that must be solved numeri- 
cally, leading to the densities plotted in Fig. 17.1. However, one can obtain some informa- 
tion on t(E~!). From Eqs. (17.31) and (17.34), one obtains 


li 1 1 
r(Œ7!) = = (17.37) 


q(b2—1l)+1—bq  1-4* 
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Figure 17.1 Density of eigenvalues for a sample covariance matrix with exponential temporal 
correlations for three choices of parameters g and b such that gb = 0.25. All three densities are 
normalized, have mean 1 and variance oR = qb = 0.25. The solid light gray one is the Maréenko— 
Pastur density (q = 0.25), the dotted black one is very close to the limiting density for q —> 0 with 


o? = bg fixed. 


a 


Figure 17.2 Effective value g* versus the one-lag autocorrelation coefficient a for a sample 
covariance matrix with exponential temporal correlations shown for two values of q. The dashed 
lines indicate the approximation (valid at small a) g* = q(1+ 2a). The approximation means that, 


for 10% autocorrelation, q* is only 2% greater than q. 


where q* = N/T™* defines the effective length of the time series, reduced by temporal 
correlations (compare with Eq. (17.14) with C = 1). Figure 17.2 shows q* as a function 
of a. As expected, g* = q for a = 0 (no temporal correlations), whereas g* — 1 when 


a — l1, i.e. when Te — ov. In this limit, E becomes singular. 
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Looking at Eq. (17.35), one notices that when b > 1 (corresponding toa — 1, 
i.e. slowly decaying correlations), the S-transform depends on b and q only through the 
combination qb. One can thus define a new limiting distribution corresponding to the limit 
q — 0, b > œ with gb = o? (which turns out to be the variance of the distribution, see 
below). The S-transform of this limiting distribution is given by 


1 


S(t) = LOA aoe (17.38) 
while the equation for the T-transform boils down to a cubic equation that reads: 
ZPO) — 20°zt?(2)(1 + t(z)) = + tz)”. (17.39) 
The corresponding R-transform is 
R(z) = —— 
V1 = 202z 
=1+07z+ sot? +002), (17.40) 


The last equation gives its first three cumulants: its average is equal to one, its variance 
2 as announced above, and its skewness is K3 = 304, We notice that this skewness is 
larger than that of a white Wishart with the same variance (q = o?) for which x3 = of. 
The equations for the Stieltjes g(z) and the T-transform are both cubic equations. The 
corresponding distribution of eigenvalues is shown in Figure 17.3. Note that, unlike the 
Marcenko-—Pastur, there is always a strictly positive lower edge of the spectrum à- > 0 
and no Dirac at zero even when o? > 1. Unfortunately, the equation giving A+ is a fourth 


order equation that does not have a concise solution. 


iso 
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Figure 17.3 Density of eigenvalues for the limiting distribution of sample covariance matrix with 
exponential temporal correlations W,,2 for three choices of the parameter o7: 0.25, 0.5 and 1. 
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An intuitive way to understand this particular random matrix ensemble is to consider 
N independent Ornstein—Uhlenbeck processes with the same correlation time Te that we 
record over a long time T. We sample the data at interval A, such that the total number 
of observations is T/A. We then construct a sample covariance matrix of the N variables 
from these observations. If A >> Te, then each sample can be considered independent 
and the sample covariance matrix will be a Maréenko—Pastur with q = NA/T. But if 
we “oversample” at intervals A < Te, such that our observations are strongly correlated, 
then the resulting sample covariance matrix no longer depends on A but only on te. The 
sample covariance matrix converges in this case to our new random matrix characterized 
by Eq. (17.38), with parameter o? = qb = Nt, /T. 


17.2.3 Spatial and Temporal Correlations 


In the general case where spatial and temporal correlations exist, the sample covariance 
matrix can be written as 
1 
E=— 
T 
using the same notations as above. After similar manipulations, the S-transform of E is 
found to be given by 


HH’ = LC MKH’ Ci 17.41 


— Se(t)Sk(qt) 
Selt) = hak. (17.42) 
which leads to 

telt) = gtéc(t)ox (qt), (17.43) 

or, in terms of T-transforms, 

z 

t = tk | —————__—__ ] . 17.44 
Se a PN 


When C = 1, c(t) = (1 + t)/t and one recovers Eq. (17.29). Specializing to the case of 
exponential correlations in the limit q > 0, a > 1, qb = o7, we obtain the following 
equation for the T-transform of the limiting distribution, now for an arbitrary covariance 
matrix C: 


z? — 207 ztp(z)éc(te(z)) = Ce (te(z)), (17.45) 


where we used tx(z) = —1/Vz* —2zb+ 1. When C = 1, one recovers Eq. (17.39). 
When C is an inverse-Wishart matrix, c(t) = (t + 1)/t(1 — pt), the equation for tg (z) is 
of fourth order. 

Note finally that Eq. (17.44), in the limit z — 0, yields a simple generalization of 
Eq. (17.31) that reads 


(C7!) 


[as eae ea 
EIEE Er 


(17.46) 
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Comparing with Eq. (17.14) allows us to define an effective length of the time series which, 
interestingly, is independent of C and reads 


N 
q* := o 1+ qéx(-4q). (17.47) 


Exercise 17.2.1 On the futility of oversampling 
Consider data consisting of N variables (columns) with true correlation C and 
T independent observations (rows). Instead of computing the sample covariance 
matrix with these T observations, we repeat each one m times and sum over 
mT columns. Obviously the redundant columns should not change the sample 
covariance matrix, hence it should have the same spectrum as the one using only 
the original T observations. 


(a) The redundancy of columns can be modeled as a temporal correlation with 
an mT x mT covariance matrix K that is block diagonal with T blocks of 
size K and all the values within one block equal to 1 and zero outside the 
blocks. Show that this matrix has T eigenvalues equal to m and (T — 1)m zero 
eigenvalues. 

(b) Compute tg (z) for this model. 

(c) Show that Sk(t) = (1 +4)/( + mt). 

(d) If we include the redundant columns we have a value of qm = N/(mT), but 
we need to take temporal correlations into account so Sg (t) = Sc (t) Sk (gmt) / 
(1 + qmt). Show that in this case S(t) = Se(t)/U + qt) with q = N/T, 
which is the result without the redundant columns. 


17.3 Time Dependent Variance 


Another common and important situation is when the N correlated time series are het- 
eroskedastic, i.e. have a time dependent variance. More precisely, we consider a model 
where 


x} = 0 Hit, (17.48) 


where o; is time dependent, and 


UH; H js] = ôrsCij, (17.49) 


i.e. xf is the product of a time dependent factor o; and a random variable with a general 
correlation structure C but no time correlations. The scm E can be expressed as 


T 
1 
E= >OP, P, := yo MH, (17.50) 
t=1 
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where each P; is a rank-1 matrix with a non-zero eigenvalue that converges, when N and 
T tend to infinity, to qo? t(C) with, as always, q = N/T. 

We will first consider the case C = 1, i.e. a structureless covariance matrix. In this case, 
the vectors x’ are rotationally invariant, the matrix E can be viewed as the free sum of a 
large number of rank-1 matrices, each with a non-zero eigenvalue equal to qo. Hence, 


T 
Re(g) = È Ri(g)- (17.51) 


To compute the R-transform of the matrix E we need to compute the R-transform of a 
rank-1 matrix. Note that since there are T terms in the sum, we will need to know R, (g) 
including correction of order 1/N: 


noae (“+ : J= Del a (17.52) 
. N\ z z- qof N z(z — qo?) 
Inverting to first order in 1/N we find 
me E (17.53) 
l g N1i-—qofg 
Now, since R(z) = 3(g) — 1/z, we find 
T 
Re(g) = e E (17.54) 


The fluctuations of a7 can be stochastic or deterministic. In the large T limit we can encode 
them with a probability density P(s) for s = ø? and convert the sum into an integral, 
leading to! 


Re(8) = ie 2O ds. (17.55) 
o l-qsg 


Note that if the variance is always | (i.e. P (s) = 6(s — 1)), we recover the R-transform of 
a Wishart matrix of parameter q: 


Rg(g) = (17.56) 


t= 


In the general case, the R-transform of E is simply related to the T-transform of the distri- 
bution of s: 


Ru(g) = ts (—) . (17.57) 
q8 


1 When the distribution of s is bounded, the integral (17.55) always converges for small enough g and the R-transform is well 
defined near zero. For unbounded s, the R-transform can be singular at zero indicating that the distribution of eigenvalues 
doesn’t have an upper edge. 
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In the more general case where C is not the identity matrix, one can again write the SCM as 
~ 1 1 

E = C2EC2?, where E corresponds to the case C = 1 that we just treated. Hence, using 
the fact that C and E are mutually free, the S-transform of E is simply given by 


SE) = Sc(t) Se (t). (17.58) 


Another way to treat the problem is to view the fluctuating variance as a diagonal temporal 
covariance matrix with entries drawn from P(s). Following Section 17.2.3, we can write 


Ss(qt Sc()Ss (qt 
HD gy = SOSUN 
1+qt 1+qt 
with S,(t) the S-transform associated with t,(¢). 

A particular case of interest for financial applications is when P(s) is an inverse-gamma 


distribution. When x’ is a Gaussian multivariate vector, one obtains for o;x’ a Student 
multivariate distribution (see bibliographical notes for more on this topic). 


Set) = : (17.59) 


17.4 Empirical Cross-Covariance Matrices 


Let us now consider two time series x’ and y’, each of length 7, but of different 
dimensions, respectively N, and N2. The empirical cross-covariance matrix is an N1 x N2 
rectangular matrix defined as 


Ey = = Y xg. (17.60) 


Let us assume that the “true” cross-covariance matrix E[xy’ ] is zero, i.e. that there are 
no true cross-correlations between our two sets of variables. What is the singular value 
spectrum of Eyy in this case? 

As with scm that are described by the Maréenko—Pastur law when N,T — oo with 
a fixed ratio g = N/T, we expect that some non-trivial results will appear in the limit 
N1, N2,T — œ with qı = N/T and q2 = N2/T finite. A convenient way to perform 
this analysis is to consider the eigenvalues of the Ny x Nj matrix Myy = Ex yEXy, which 
are equal to the square of the singular values s of Eyy. 

The matrix My y Shares the same non-zero eigenvalues as those of E,E,, where Ẹ xX 


and £, are the dual T x T sample covariance matrices: 
E,=x'x, Ê, =y’y. (17.61) 


Hence one can compute the spectral density of Mx y using the free product formalism and 
infer the spectrum of the product E,Ey. However, the result will depend on the “true” 
covariance matrices of x and y, which are usually unknown in practical applications. 

A way to obtain a universal result is to consider the sample-normalized principal 
components of x and of y, which we call X and F, such that the corresponding dual 
covariance matrix Ey has Nj eigenvalues exactly equal to 1 and T — N; eigenvalues 
exactly equal to zero, whereas Eş has N2 eigenvalues exactly equal to 1 and T — N2 
eigenvalues exactly equal to zero. This is precisely the problem studied in Section 15.4.2. 
The singular value spectrum of Eyy is thus given by 


V(s? = y- = 8?) 
ms(1—s) 


(17.62) 


p(s) = max(q; + g2 — 1,0)d(s — 1) + Re 
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where y+ are given by 


y+ =q +Q -21Qt2Vamd-—q‘d—-q@), 0<y4<1. (17.63) 


The allowed s’s are all between 0 and 1, as they should be, since these singular values can 
be interpreted as correlation coefficients between some linear combination of the x’s and 
some other linear combination of the y’s. 

In the limit T — œ at fixed Ny, N2, all singular values collapse to zero, as they 
should since there are no true correlations between x and y. The allowed band in the limit 
41:92 — 0 becomes 


s € [IVa - Jal vai + Va]. 


showing that for fixed N1, N2, the order of magnitude of allowed singular values decays 


1 . . . . 
as T 2. The above result allows one to devise precise statistical tests to detect “true” 
cross-correlations between sets of variables. 
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Bayesian Estimation 


In this chapter we will review the subject of Bayesian estimation, with a particular focus on 
matrix estimation. The general situation one encounters is one where the observed matrix 
is a noisy version of the “true” matrix one wants to estimate. For example, in the case of 
additive noise, one observes a matrix E which is the true matrix C plus a random matrix X 
that plays the role of noise, to wit, 


E=C+X. (18.1) 
In the case of multiplicative noise, the observed matrix E has the form 
E=C?2WC?. (18.2) 


When W is a white Wishart matrix, this is the problem of sample covariance matrix encoun- 
tered in Chapter 17. 

In general, the true matrix C is unknown to us. We would like to know the probability 
of C given that we have observed E, i.e. compute P(C|E). This is the general subject of 
Bayesian estimation, which we introduce and discuss in this chapter. 


18.1 Bayesian Estimation 


Before doing Bayesian theory on random matrices (see Section 18.3), we first review 
Bayesian estimation and see it at work on simpler examples. 


18.1.1 General Framework 


Imagine we have an observable variable y that we would like to infer from the observation 
of a related variable x. The variables x and y can be scalars, vectors, matrices, higher 
dimensional objects ... We postulate that we know the random process that generates y 
given x, i.e. y could be a noisy version of x or more generally y could be drawn from 
a known distribution with x as a parameter. The generation process of y is encoded in a 
probability distribution P(y|x), which is called the sampling distribution or the likelihood 
function. 
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Given our knowledge of P(y|x), we would like to write the inference probability 
P(x|y), also called the posterior distribution. To do so, we can use Bayes’ rule: 

P(x|y) = ZOE (18.3) 

PO) 

To obtain the desired probability, Bayes’ rule tells us that we need to know the prior 
distribution Po(x). In theory Po(x) is the distribution from which x is drawn and it is 
in some cases knowable. In many practical applications, however, x is actually not random 
but simply unknown and Po(x) encodes our ignorance of x. It should represent our best 
(probabilistic) guess of x before we observe the data y. The determination (or arbitrariness) 
of the prior Po(x) is considered to be one of the weak points of the Bayesian approach. 
Often Po(x) is just taken to be constant, i.e. no prior knowledge at all on x. However, note 
that Po(x) = constant is not invariant upon changes of variables, for if x’ = f(x) is a non- 
linear transformation of x, then Po(x’) is no longer constant! In Section 18.1.3, we will see 
how the arbitrariness in the choice of Po(x) can be used to simplify modeling. 

The other distribution appearing in Bayes’ rule P(y) is actually just a normalization 
factor. Indeed, y is assumed to be known, therefore P (y) is just a fixed number that can be 
computed by normalizing the posterior distribution. One therefore often simplifies Bayes’ 
rule as 


1 
LOI gO z= [ou P(y|x) Po(x), (18.4) 


where P(y|x) represents the measurement (or noise) process and Po(x) the (often arbitrary) 
prior distribution. 

From the posterior distribution P(x|y) we can build an estimator of x. The optimal esti- 
mator depends on the problem at hand, namely, which quantity are we trying to optimize. 
The most common Bayesian estimators are 


1 mMsE: The posterior mean E[x],. It minimizes a quadratic loss function and is hence 
called the Minimum Mean Square Error estimator. 


2 MAVE: The posterior median or Minimum Absolute Value Error estimator. 


3 MAP: The Maximum A Posteriori estimator, defined as X = argmax, P(x|y). 


18.1.2 A Simple Estimation Problem 
Consider the simplest one-dimensional estimation problem: 
y=x +E, (18.5) 


where x is some signal to be estimated, £ is an independent noise, and y is the observation. 
Then P(y|x) is simply P,(.) evaluated at y — x: 


P(y|x) = Pe(y — x). (18.6) 
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Suppose further that ¢ is a centered Gaussian noise with variance a, where the subscript 
n means “noise”. Then we have 


P(y|x) z ( o) (18.7) 
x)= ex ‘ . 
a / 2x02 p 202 
Then we get that 
2xy — x? 
P(x|y) x Po(x) exp =a (18.8) 
On 


where Po(x) is the prior distribution of x and we have dropped x-independent factors. 
Depending on the choice of Po(x) we will get different posterior distributions and hence 
different estimators of x. 


Gaussian Prior 
Suppose first Po(x) is a Gaussian with variance of (for signal) centered at xo. Then 


(x — xo)?  2xy— £) 


P 
(x|y) x exp ( 202 20? 


a\2 
ot oo =] (18.9) 


202 
with 
îŝ := xo +r (y — xo) = (1 — r)xo + ry; o? := rof, (18.10) 


where the signal-to-noise ratio r is r = o2 /(o2 + oŻ). The posterior distribution is thus a 
Gaussian centered around ĉ and of variance o?. 

For a Gaussian distribution the mean, median and maximum probability values are all 
equal to x, which is therefore the optimal estimator in all three standard procedures, MMSE, 
MAVE and map. This estimator is called the linear shrinkage estimator as it is linear in 
the observed variable y. The linear coefficient of y is the signal-to-noise ratio r, a number 
smaller than one that shrinks the observed value towards the a priori mean xo. 

Note that this estimator can also be obtained in a completely different framework: it 
is the affine estimator that minimizes the mean square error. The estimator is affine by 
construction and minimization only involves first and second moments; it is therefore not 
too surprising that we recover Eq. (18.10), see Exercise 18.1.2. As so often in optimization 
problems, assuming Gaussian fluctuations is equivalent to imposing an affine solution. 

Another important property of the linear shrinkage estimator is that it is rather conserva- 
tive: it is biased towards xo. By assumption x fluctuates with variance og and y fluctuates 
with variance o2 + oĉ. This allows us to compute the variance of the estimator £ (y) as 


of 
VRO) =r? (0 +02) = —S— < o. 


18.11 
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So the variance of the estimator! is not only smaller than that of the observed variable y it 
is also smaller than the fluctuations of the true variable x! 


Exercise 18.1.1 Optimal affine estimator 
Suppose that we observe a variable y that has some non-zero covariance with 
an unknown variable x that we would like to estimate. We will show that the 
best affine estimator of x is given by the linear shrinkage estimator (18.10). The 
variables x and y can be drawn from any distribution with finite variance. We 
write the general affine estimator 


R=ayt+b, (18.12) 


and choose a and b to minimize the expected mean square error. 


(a) Initially assume that x and y have zero mean — we will relax this assumption 
later. Show that 


[ = a| Lao +b? +02 —2a02, (18.13) 
where oo os and On are the variances of x, y and their covariance. 
(b) Show that the optimal estimator has a = Ge fi o? and b = 0. 
(c) Compute b in the non-zero mean case by considering x — xo estimated using 


Y= yo. 
(d) Compute o and Gy when y = x + € with e independent of x. 
(e) Show that when E[e] = 0 we recover Eq. (18.10). 


Bernoulli Prior 


When Po(x) is non-Gaussian, the obtained estimators are in general non-linear. As a second 
example suppose that Po(x) is Bernoulli random variable with Po(x = 1) = Po(x = —1) = 
1/2. Then, after a few simple manipulations one obtains 


P(xly) = ; (1 + tanh (4)) ek (1 — tanh (>) ds) ! (18.14) 


The posterior distribution is now a discrete function that takes on only two values, namely 
+1. In this case the maximum probability and the median are such that 


uar O) = sign(y). (18.15) 


1 One should not confuse the variance of the posterior distribution ro with the variance of the estimator ro. The first one 
measures the remaining uncertainty about x once we have observed y while the second measures the variability of <(y) when 
we repeat the experiment multiple times with varying x and noise €. 


18.1 Bayesian Estimation 285 


It is also easy to calculate the MMSE estimator: 


Xumse(y) = E[x]y = tanh (>) . (18.16) 
o 


n 


It may seem odd that the MMSE estimator takes continuous values between —1 and 1 while 


we postulated that the true x can only be equal to +1. Nevertheless, in order to minimize the 
variance it is optimal to shoot somewhere in the middle of —1 and 1 as choosing the wrong 
sign costs a lot in terms of variance. The estimator x(y) is biased, i.e. E[tymsplx] # x. It 
also has a variance strictly less than 1, whereas the variance of the true x is unity. 


Laplace Prior 


As a third example, consider a Laplace distribution 
b bjx] 
Po(x) = 5° (18.17) 


for the prior, with variance 2b7?. In this case the posterior distribution is given by 


2xy — x? 

P(x|y) x exp | —b|x| + 5 : (18.18) 
20; 

The MMSE and MAVE estimators can be computed but the results are not very enlightening 

as they are given by an ugly combination of error functions and even inverse error functions 

(for MAVE). The MAP estimator is both simpler and more interesting in this case. It is 

given by 


0 for |y] < bog, (18.19) 


Xmar(y) = : : 
yr bo? sign(y) otherwise. 
The MapP estimator is sparse in the sense that in a non-zero fraction of cases it takes the 
exact value of zero. Note that the true variable x itself is not sparse: it is almost surely 


non-zero. This example is a toy-model for the “LAsso” regularization that we will study in 
Section 18.2.2. 


Non-Gaussian Noise 


The noise in Eq. (18.5) can also be non-Gaussian. When the noise has fat tails, one can even 
be in the counter-intuitive situation where the estimator is not monotonic in the observed 
variable, i.e. the best estimate of x decreases as a function of its noisy version y. For 
example, if x is centered unit Gaussian and £ is a centered unit Cauchy noise, we have 


—x?/2 


(18.20) 


Whereas the Cauchy noise ¢ and the observation y do not have a first moment, the 
posterior distribution of x is regularized by the Gaussian weight and all its moments 
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ELx}, 


y 


Figure 18.1 A non-monotonic optimal estimator. The MMSE estimator of a Gaussian variable 
corrupted by Cauchy noise (see Eq. (18.21)). For small absolute observations y, the estimator is 
almost linear with slope 2 — ./2/ez /erfc(1/ »/2) ~ 0.475 (dashed line). 


are finite. After some tedious calculation we arrive at the conditional mean or MMSE 
estimator: 


ee 1+i 

—-—, where ®=e"erfc ( T >) f (18.21) 
V2 

The shape of the estimator as a function of y is not obvious from this expression but it is 
plotted numerically in Figure 18.1. The interpretation is the following: 


e When we observe a small (order 1) value of y, we can assume that it was generated by a 
moderate x with moderate noise, hence we are in the regime of the linear estimator with 
a signal-to-noise ratio close to one-half ($ ~ 0.475y). 

e On the other hand, when y is much larger than the standard deviation of x it becomes 
clear that y can only be large because the noise takes extreme values. When the noise is 
large our knowledge of x decreases, hence the estimator tends to zero as |y| —> oo. 


18.1.3 Conjugate Priors 


The main weakness of Bayesian estimation is the reliance on a prior distribution for the 
variable we want to estimate. In many practical applications one does not have a prob- 
abilistic or statistical knowledge of Po(x). The variable x is a fixed quantity that we do 
not know, so how are we supposed to know about Po(x)? In such cases we are left with 
making a reasonable practical guess. Since Po(x) is just a guess, we can at least choose a 
functional form for Po(x) that makes computation easy. This is the idea behind “conjugate 
priors”. 


18.1 Bayesian Estimation 287 


a= 30,b= 30 


Figure 18.2 The inverse-gamma distribution Eq. (18.24). Its mean is given by b/(a — 1) and it 
becomes increasingly peaked around this mean as both a and b become large. 


When we studied the one-dimensional estimation of a variable x corrupted by additive 
Gaussian noise (Eq. (18.5), with Gaussian £) we found that choosing a Gaussian prior for 
x gave us a Gaussian posterior distribution. In many other cases, we can find a family of 
prior distributions that will similarly keep the posterior distribution in the same family. This 
concept is better explained in an example. 

Imagine we are given a series of T numbers {y;} generated independently from a cen- 
tered Gaussian distribution of variance c that is unknown to us. We use the variable c rather 
than o? to avoid the confusion between the estimation of ø and that of c = o7. The joint 
probability of the y’s is given by 


1 y'y 
P(y|c) = Cro TA exp ( E ). (18.22) 


The posterior distribution is thus given by 


P(cly) x Po(c)e~7/? exp (-) (18.23) 
Cc 


Now if the prior Po(c) has the form Po(c) « c74le-4/¢ the posterior will also be of that 
form with modified values for a and b. Such a Po(c) will thus be our conjugate prior. This 
law is precisely the inverse-gamma distribution (see Fig. 18.2): 


Po(c) = D" onan le-b/e (c > 0). (18.24) 
Tr(a) 


It describes a non-negative variable, as a variance should. It is properly normalized when 
a > 0 and has mean b/(a — 1) whenever a > 1. If we choose such a law as our variance 
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prior, the posterior distribution after having observed the vector y is also an inverse-gamma 
with parameters 
T aed 
a = at 7 and bo) =b+ ik (18.25) 
The MMSE estimator can then just be read off from the mean of an inverse-gamma distribu- 
tion: 


ee yy (18.26) 
4-1 2(a—-1)+T 


which can be written explicitly in the form of a linear shrinkage estimator: 


= T 
~ a= TT 


F 
i[cly = d- ro+r with r 


(18.27) 


and co = b/(a — 1) is the mean of the prior. We see that r —> 1 when T — oo: in this case 
the prior guess on cg disappears and one is left with the naive empirical estimator y’y/T. 


Exercise 18.1.2 Conjugate prior for the amplitude of a Laplace distribution 
Suppose that we observe T variables y; drawn from a Laplace distribution 
(18.17) with unknown amplitude b. We would like to estimate b using the 
Bayesian method with conjugate prior. 


(a) Write the joint probability density of elements of the vector y for a given b. 
This is the likelihood function P(y|b). 

(b) As a function of b, the likelihood function has the same form as a gamma 
distribution (4.17). Using a gamma distribution with parameters ag and bo 
for the prior on b show that the posterior distribution of b is also a gamma 
distribution. Find the posterior parameters ap and bp. 

(c) Given that the mean of a gamma distribution is given by a/b, write the MMSE 
estimator in this case. 

(d) Compute the estimator in the two limiting cases T = 0 and T > ov. 

(e) Write your estimator from (c) as a shrinkage estimator interpolating 
between these two limits. Show that the signal-to-noise ratio r is given by 
r = Tm/(Tm + 2bo) where m = > |y;|/T. Note that in this case the 
shrinkage estimator is non-linear in the naive estimate b=1 /(2m). 


18.2 Estimating a Vector: Ridge and LASSO 


A very standard problem for which Bayesian ideas are helpful is linear regression. Assume 
we want to estimate the parameters a; of a multi-linear regression, where we assume that 
an observable y can be written as 
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N 
y=} aixi +e, (18.28) 
i=1 
where x; are N observable quantities and € is noise (not directly observable). We observe 
a time series of y of length T that we stack into a vector y, whereas the different x; are 
stacked into an N x T data matrix Hj, = xj, and e is the corresponding T-dimensional 
noise vector. We thus write 


y=H'ate, (18.29) 


where a is an N-dimensional vector of coefficients we want to estimate. We assume the 
following structure for the random variables x and e: 


1 1 
T [ee"] = 071; zeH] =C, (18.30) 


where C can be an arbitrary covariance matrix, but we will assume it to be the identity 1 in 
the following, unless otherwise stated. 
Classical linear regression would find the coefficient vector a that minimizes the error 


& = |ly — Hal? on a given dataset. As is well known, the regression coefficients are 
given by 


areg = (HH") ' Hy. (18.31) 


This equation can be derived easily by taking the derivatives of & with respect to all a; and 
setting them to zero. Note that when q := N/T < 1, HH’ is in general invertible, but 
when q > | (i.e. when there is not enough data), Eq. (18.31) is a priori ill defined. 

In a Bayesian estimation framework, we want to write the posterior distribution P (aly) 
and build an estimator of a from it. We expect that the Bayesian approach will work better 
than linear regression “out of sample”, i.e. on a new independent sample. The reason is 
that the linear regression method minimizes an “in-sample” error, and is thus devised to fit 
best the details of the observed dataset, with no regard to overfitting considerations. These 
concepts will be clarified in Section 18.2.3. 

Following the approach of Section 18.1, we write the posterior distribution as 


1 
P(aly) x Po(a) exp (-- ly- n'al?) , (18.32) 
n 


where o? is the variance of the noise ¢. Now, the art is to choose an adequate prior 
distribution Po(a). 


18.2.1 Ridge Regression 


The likelihood function in Eq. (18.32) is a Gaussian function of a, so choosing a 
Gaussian prior for Po(a) will give us a Gaussian posterior. To construct a Gaussian 
distribution for Po(a) we need to choose a prior mean ag and a prior covariance matrix. 
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Regression coefficients can be positive or negative, so the most natural prior mean is the 

zero vector ag = 0. In the absence of any other information about the direction in which 

a may point, we should make a rotationally invariant prior for the covariance matrix.” The 

only rotationally invariant choice is a multiple of the identity ogl for the prior covariance. 

Assuming that the coefficients a; are IID gives the same answer. However, we do not have a 

good argument to set the scale of the covariance of: we will come back to this point later. 
The posterior distribution is then written 


1 2: 
P(aly) o exp (- (# (mr 4 %1) a— 2a" Hy) ) (18.33) 
2 | 


S 


As announced, the posterior is a multivariate Gaussian distribution. The MMSE, MAVE and 
MAP estimator are all equal to the mode of the distribution, given by? 


aise oe eee (18.34) 
See T’ EST 


This is called the “ridge” regression estimator, as it amounts to adding weight on the 
diagonal of the sample covariance matrix (HH’)/T. This can also be seen as a shrinkage 
of the covariance matrix towards the identity, as we will discuss further in Section 18.3 
below. 

Another way to understand what ridge regression means is to notice that Eq. (18.31) 
involves the inverse of the covariance matrix (HH’)/7, which can be unstable in large 
dimensions. This instability can lead to very large coefficients in a. One can thus regularize 
the regression problem by adding a quadratic (or L-norm) penalty for a so the vector does 
not become too big: 


Aridge = argmin [lly a H'a|? +T¢ lal? | : (18.35) 
a 


Setting ¢ = 0 we recover the standard regression. The solution of the regularized opti- 
mization problem yields exactly Eq. (18.34); it is often called the Tikhonov regularization. 
Note that the resulting equation for ayidge remains well defined even when q > 1 as long as 
¢>0. 

In both approaches (Bayesian and Tikhonov regularization) the result depends on the 
choice of the parameter ¢ = on / (To?) which is hard to estimate a priori. The modern way 
of fixing ¢ in practical applications is by using a validation (or cross-validation) method. 
The idea is to find the value of ariage on part of the data (the “training set’) and measure the 
quality of the regression on another, non-overlapping part of the data (the “validation set”). 
The value of ¢ is then chosen as the one that gives the lowest error on the validation set. 


2 This assumption relies on our hypothesis that the covariance matrix C of the x’s is the identity matrix. Otherwise, the 
eigenvectors of C could be used to construct non-rotationally invariant priors. 

We have introduced a factor of 1/7 in the definition of ¢ so it parameterizes the shift in the normalized covariance matrix 
(HHT )/T. It turns out to be the proper scaling in the large N limit with q = N/T fixed. Note that if the elements of a and H 
are of order one, the variance of the elements of HT a is of order N; for the noise to contribute significantly in the large N limit 
we must have o2 of order N and hence ¢ of order 1. 


3 
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In cross-validation, the procedure is repeated with multiple validation sets (always disjoint 
from the training set) and the error is then averaged over these sets. 


18.2.2 LASSO 


Another common estimating method for vectors is the “Lasso” method* which combines a 
Laplace prior with the MAP estimator. 

In this method, the prior distribution amounts to assuming that the coefficients of a are 
1D Laplace random number with variance 2b~7. The posterior then becomes 


N 
1 2 
P(aly) œ exp | =b ) lail- — lly - Hal]. 
4 20; 
As in the toy model Eq. (18.18), the MMSE and MAVE estimators look rather ugly, but the MAP 
one is quite simple. It is given by the maximum of the argument of the above exponential: 


N 

arasso = argmin | 2bo2 Y` Jaj| + |y — Hall” |. (18.36) 

i i=l 
This minimization amounts to regularizing the standard regression estimation with an abso- 
lute value penalty (also called L!-norm penalty), instead of the quadratic penalty for the 
ridge regression. Interestingly, the solution to this minimization problem leads to a sparse 
estimator: the absolute value penalty strongly disfavors small values of |a;| and prefers 
to set these values to zero. Only sufficiently relevant coefficients a; are retained — LASSO 
automatically selects the salient factors (this is the “so” part in LASSO), which is very useful 

for interpreting the regression results intuitively. 

Note that the true vector a is not sparse, as the probability to find a coefficient a; to 
be exactly zero is itself zero for the prior Laplace distribution, which does not contain a 
singular 5(a) peak. The sparsity of the Lasso estimator arasso is controlled by the param- 
eter b. When bo? — 0, the penalty disappears and all the coefficients of the vector a are 
non-zero (barring exceptional cases). When bo? — œ, on the other hand, all coefficients 
are zero. In fact, the number of non-zero coefficients is a monotonic decreasing function of 
bož. As for the parameter ¢ for the ridge regression, it is hard to come up with a good prior 
value for b, which should be estimated again using validation or cross-validation methods 
(Figure 18.3). Finally we note that it is sometimes useful to combine the L! penalty of 
Lasso with the L? penalty of ridge, the resulting estimator is called an elastic net. 


18.2.3 In-Sample and Out-of-Sample Error 
Standard linear regression is built to minimize the sum of the squared-residuals on the 


dataset at hand. We call this error the in-sample error. In many cases, we are interested in 


4 Lasso stands for Least Absolute Shrinkage and Selection Operator. 
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Figure 18.3 Illustration of the validation method in LAsso regularization. We built a linear model 
with 500 coefficients drawn from a Laplace distribution with b = 1 and Gaussian noise oe = 1000. 
The model is estimated using T = 1000 unit Gaussian data and validated using a different set of 
the same size. The error on the training set is minimal with no regularization and gets worse as b 
increases (dashed line, left axis). The validation error (full line, left axis) is minimal for b about equal 
to 1. The dotted line (right axis) shows the fraction of the coefficient estimated to be exactly zero; 
this number grows from zero (no regularization) to almost 1 (strong regularization). 


the predictive power of a linear model and the relevant error is the mean square error on a 
new independent but statistically equivalent dataset: the out-of-sample error. If the number 
of fitted variables (degree of freedom) is small with respect to the number of samples, 
the in-sample error is a good estimator of the out-of-sample error and the standard linear 
regression is also the optimal linear predictive model. The situation changes radically when 
the number of fitted variables becomes comparable to the number of samples, as we discuss 
below. 
To summarize, linear regression, regularized or not, addresses two types of task: 


¢ In-sample estimator: we observe some H; and y1, and estimate a. 
e Out-of-sample prediction: we observe some other H2, non-overlapping with Hı, and 
use them to predict y2 with the in-sample estimate of a. 


The result of the in-sample estimation is given by Eq. (18.31), which we write as 


areg = E"'b; E := lam, b := iiyn (18.37) 
T T 
This is the best in-sample estimator. However, this is not necessarily the case for the out- 
of-sample prediction. 
Note that both the standard regression and the ridge regression estimator (Eq. (18.34)) 
are of the form 4 = E7 !b with E = E and E = E + ¢1, respectively. We will compute 


18.2 Estimating a Vector: Ridge and LASSO 293 


in the following the in-sample and out-of-sample estimation error for any estimator of 
that nature. 

Recalling that E[ee7] = ogl and after some calculations, one finds that the in-sample 
(RZ) error is given by 


an 
Rj, Â) =F [oz [7 — 2 Tr(E7!E) + Te(E7'EST'E)| 
+ a” (E - 2E27'E + E2'EZ™'E) al. (18.38) 


In the special case & = E, we have 


2 On 2 
Rin (@reg) = F (T-N)=(— Q)oy, 


which is smaller than the true error, which is simply equal to og. In fact, the error goes 
to zero as q —> 1, i.e. when the number of parameters becomes equal to the number of 
observations. This error reduction is called “overfitting”, or in-sample bias: if the task is 
to find the best model that explains past data, one can do better than the true error. Note 
that the above result is quite special, in the sense that it actually does not depend on either 
Eora. 

Next we calculate the expected out-of-sample (R2,,,) error. We draw another matrix H2 
of size N x T and consider another independent noise vector €2 of variance og and size 
T> (where Tz does not need to be equal to T, it can even be equal to 1). We calculate 


A 1, a2 
Rey (@) = Eme» | | Hza + e2 — Hil” | 
2 
2 
= i J T. _ Atal 2e T=- l 
= He | |Ha + e2 — Hj € 'Eia — H57 Hie |, (18.39) 
T T 
where we denote E; := T~'H\ Hi. We now assume that T, : =[H2H3] = C with a general 


covariance C. 
In the standard regression case, E = E and a = areg and we have 


Tey al 
RŽ (reg) = 7 GH, e2 e= H? fo) 1" He, 
T T 


2 2 
l =02+ oe Tr(E-!C). (18.40) 


Now since E is a sample covariance matrix with true covariance C, we have 
=] ang eres, PEN = N 
Tr(E~'C) =1r(C 2W7'C C) = TH(W;') = —, (18.41) 
=q 
where W, denotes a standard Wishart matrix. Thus we find 


2 2 R2 (areg) 
R? Sige ieee 18.42 
out (areg) On + 1 = q 1 = q a = q) ( ) 
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As an illustration see Figure 18.3 where without regularization (b = 0) we have indeed 
Re it IR ~ (1 — q)~? = 4. Thus, we see that we can make precise statements about the 
following intuitive inequalities: 


in-sample error < true error < out-of-sample error. (18.43) 


Note that the out-of-sample error tends to œ as N > T. 
Now, let us compute the expected out-of-sample error (RÈ for the ridge predictor 
aridge, parameterized by ¢. The result reads 


2 
o 
Re (Bridge) = 02 + T Tr(CE7!EE7!) + ¢7 Tr(C27 laa" E7!) (18.44) 
with & = E + ¢1. Expanding to linear order for small ¢ then leads to 
2 2 2on = 2 
Rout (ridge) = Rout @reg) — 7 Tr(CE™*)¢ + O(¢") 
202 
= R2 (areg) — a + 0(¢2), (18.45) 


where T(.) = Tr(.)/N and we have used the fact that t(W,*) =(1- gyn” The important 
point here is that the coefficient in front of ¢ is negative, i.e. to first order, the ridge estimator 
has a lower out-of-sample error than the naive regression estimator: 


ridge estimation error < naive estimation error. (18.46) 


However, the ridge estimator introduces a systematic bias since ll ridge ll? < llaregll? when 
¢ > 0. This gives the third term in Eq. (18.44), which becomes large for larger ¢. So one 
indeed expects that there should exist an optimal value of ¢ (which depends on the specific 
problem at hand) which minimizes the out-of-sample error. We now show how this optimal 
out-of-sample error can be elegantly computed in the large N limit. 


The Large N Limit 


In the large N limit we can recover the fact that the ridge estimator of Section 18.2.1 
minimizes the out-of-sample risk without the need of a Gaussian prior on a. We will also 
find an interesting relation between the Wishart Stieltjes transform and the out-of-sample 
risk of the ridge estimator. 

In the following we will assume that the elements of the out-of-sample data H are 11D 
with unit variance, i.e. that C = 1. Then when & = E}; + ¢1, Eq. (18.44) becomes, in the 
large N limit, 


Reut@ridge) = On (1 — qow, (—S)) + ¢ (gon — lal?) aw, (—3), (18.47) 
where we have used that E4 is a Wishart matrix free from aa’, and that 
T(E- E7) =9w, 0); (E1 - Ey) *) = — oy, 2). (18.48) 


For ¢ = 0, we have SW, (0) = —1/(1 — q) and we thus recover Eq. (18.42). In the 
large N limit, the out-of-sample error for an estimator with & = E; + ¢1 depends on the 
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vector a only through its norm lal’, regardless of the distribution of its components. The 


optimal value of ¢ must then also only depend on lal?. 
Now, we know that when a is drawn from a Gaussian distribution, the value ¢opt = 


of / (To?) is optimal. In the large N limit, |a|? is self-averaging and equal to N og. So 


Copt = qog / |al*. We can check directly that this value is optimal by computing the 
derivative of Eq. (18.47) with respect to ¢ evaluated at Copt- Indeed we have 


ongs, opt) — Soptlal ay, (—Zopt) = 0. (18.49) 
For the optimal value of ¢ we also have 
Ret (Bridge) = og (1 — SW, (topt) ' (18.50) 


where SW, (z) is given by Eq. (4.40). Since —9W, (—z) is positive and monotonically 
decreasing for z > 0, we recover that the optimal ridge out-of-sample error is smaller than 
that of the standard regression. 


18.3 Bayesian Estimation of the True Covariance Matrix 


We now apply the Bayesian estimation method to covariance matrices. From empirical data, 
we measure the sample covariance matrix E, and want to infer the most reliable information 
about the “true” underlying covariance matrix C. Hence we write Bayes’ equation for 
conditional probabilities for matrices: 


P(C\E) x P(E|C) Po(C). (18.51) 


We now recall Eq. (4.16) established in Chapter 4 for Gaussian observations: 
T 
P (ŒC) œ (det C)~?/? exp |-50 (c'e)|. (18.52) 


As explained in Section 18.1.3, in the absence of any meaningful prior information, it is 
interesting to pick a conjugate prior, which here is of the form 


Py (C) œ (det C)“ exp |-2 Tr (c"'x)] (18.53) 


for some matrix X, which turns out to be proportional to the prior mean of C. Indeed, 
this prior is in fact the probability density of the elements of an inverse-Wishart matrix. 
Consider an inverse-Wishart matrix C of size N, T* degree of freedom and centered at a 
(positive definite) matrix X. If T* > N + 1, C has the density (see Eq. (15.35)) 


y T*-N-1 
P(C) x (det C) TN)? exp |i Tr (x)| ; (18.54) 


Note that here T* > N is some parameter that is unrelated to the length of the time series 
T. The chosen normalization is such that Eọ[C] = X. As T* — ov, we have C > X. 
With this prior we thus obtain 


. T 
P(CIE) x (det C) 7 2+ 2*+N+D/2 exp |-5 Tr (c'e’)| (18.55) 
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where we define 


T*-N-1 
E* := E+ —7 A (18.56) 


We now notice that (18.55) is, by construction, also a probability density for an inverse- 
Wishart with T = T + T*, with 
TE* 


[C]E] = ———______ = rE 1l-—r)X 18.57 
[CIE] = 5s yy E+N, (18.57) 


with 
T 
es oe 
Hence we recover a linear shrinkage, similar to Eq. (18.10) in the case of a scalar variable 
with a Gaussian prior. We will recover this shrinkage formula in the context of rotationally 
invariant estimators in the next chapter, see Eq. (19.49). 
We end with the following remarks: 


r 


(18.58) 


e The linear shrinkage works even for the finite N case, i.e. without the large N hypothesis. 
e In general, if one has no idea of what X should be, one can use the identity matrix, i.e. 


S[C]E] = rE + (1 — r)1. (18.59) 


Another simple choice is a covariance matrix X corresponding to a one-factor model (see 
Section 20.4.2): 


Xij = 0, [6j +p — ôi;)], (18.60) 


where p is the average pairwise correlation (which can also be learned using validation). 
e Note that T* (or equivalently r) is generally unknown. It may be inferred from the data 
or learned using validation. 
e As we will see in Chapter 20, the linear shrinkage works quite well in financial applica- 
tions, showing that inverse- Wishart is not a bad prior for the true covariance matrix in 
that case (see Fig. 15.1). 
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Eigenvector Overlaps and Rotationally Invariant 
Estimators 


19.1 Eigenvector Overlaps 
19.1.1 Setting the Stage 


We saw in the first two parts of this book how tools from RMT allow one to infer many 
properties of the eigenvalue distribution, encoded in the trace of the resolvent of the random 
matrix under scrutiny. As in the previous chapters, random matrices of particular interest 
are of the form 


E=C+X, of E=C?WC?, (19.1) 


where X and W represent some “noise”, for example X might be a Wigner matrix in the 
additive case and W a white Wishart matrix in the multiplicative case, whereas C is the 
“true”, uncorrupted matrix that one would like to measure. One often calls E the sample 
matrix and C the population matrix. 

In this section we want to discuss the properties of the eigenvectors of E, and in particu- 
lar their relation with the eigenvectors of C. There are, at least, two natural questions about 
the eigenvectors of the sample matrix E: 


1 How similar are sample eigenvectors [v;];e(1, y) Of E and the true ones [uj ]je(1, v) of C? 

2 What information can we learn by observing two independent realizations — say 
E = C?WC? and E’ = C2WC? in the multiplicative case — that remain correlated 
through C? 


A natural quantity to characterize the similarity between two arbitrary vectors — say x 
and ¢ — is the scalar product of x and ¢. More formally, we define the “overlap” as x7¢. 
Since the eigenvectors of real symmetric matrices are only defined up to a sign, it is in 
fact more natural to consider the squared overlaps (x"¢). In the first problem alluded to 
above, we want to understand the relation between the eigenvectors of the sample matrix 
[vi ]icai,N) and those of the population matrix [u;]ic(1,). The matrix of squared overlaps 
is defined as (v7 uj), which actually forms a so-called bi-stochastic matrix (positive 
elements with the sums over both rows and columns all equal to unity). 
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In order to study these overlaps, the central tool is again the resolvent matrix (and not 
its normalized trace as for the Stieltjes transform), which we recall is defined as 


Gaz) = (1— A), (19.2) 


for any arbitrary symmetric matrix A. Now, if we expand Gg over the eigenvectors v of E, 
we obtain that 


ated (19.3) 


for any u in R”. 

We thus see from Eq. (19.3) that each pole of the resolvent defines a projection onto 
the corresponding sample eigenvectors. This suggests that the techniques we need to apply 
are very similar to the ones used above to study the density of states. However, one should 
immediately stress that contrarily to eigenvalues, each eigenvector v; for any given i contin- 
ues to fluctuate when N — oo and never reaches a deterministic limit. As a consequence, 
we will need to introduce some averaging procedure to obtain a well-defined result. We 
will thus consider the following quantity: 


(Ai, uj) := NEL u), (19.4) 


where the expectation E can be interpreted either as an average over different realizations 
of the randomness or, perhaps more meaningfully for applications, as an average for a 
fixed sample over small intervals of sample eigenvalues, of width dA = n. We choose 7 in 
the range 1 >> n >> NT! (say n = N~!/?) such that there are many eigenvalues in the 
interval dA, while keeping da sufficiently small for the spectral density to be approximately 
constant. Interestingly, the two procedures lead to the same result for large matrices, i.e. 
the locally smoothed quantity ®(A, u) is “self-averaging”. A way to do this smoothing 
automatically is, as we explained in Chapter 2, to choose z = A; — in in Eq. (19.3), 
leading to 


Imu Ge (Aj — inu; © wpR(ai) x (Ai, uj), (19.5) 


provided 7 is in the range 1 >> n >> N7!. Note that we have replaced N (u’ vi)” with 
(Aj, u j), to emphasize the fact that we expect typical square overlaps to be of order 1/N, 
such that ® is of order unity when N — ov. This assumption will indeed be shown to hold 
below. In fact, when uj and v; are completely uncorrelated, one finds ®(A;, uj) = 1. 


For the second question, the main quantity of interest is, similarly, the (mean squared) 
overlap between the eigenvectors of two independent noisy matrices E and E’: 


Waa) = NEV; v), (19.6) 


where [Ai] ie(1,N) and [vi] ie(1, N) are the eigenvalues and eigenvectors of E’, i.e. another 
sample matrix that is independent from E but with the same underlying population 
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matrix C. In order to get access to D(A; A) defined in Eq. (19.5), one should consider 
the following quantity: l 


1 
v(z,z/)= wit [Gp(z)Gpy e]. (19.7) 


After simple manipulations one readily obtains a generalized Sokhotski—Plemelj formula, 
where 7 is such that 1 > n > No}; 


Re [y O; — in, A; + in) — WA; — in, A} — in)] © 277 ADe ADY OAA). (19.8) 


This representation allows one to obtain interesting results for the overlaps between the 
eigenvectors of two independently drawn random matrices, see Eq. (19.14). 


19.1.2 Overlaps in the Additive Case 


Now, we can use the subordination relation for the resolvent of the sum of two free matrices 
established in Chapter 13, Eq. (13.44), which in the present case reads 


[GE (z)] = Ge (z — Rx(GE(<))). (19.9) 
Since we choose u; to be an eigenvector of C with eigenvalue u j, one finds 
1 


uj GECA; —in)uj = (19.10) 


ài — in — Rx (œ ài — in) — wy’ 
where we have dropped the expectation value as the left hand side is self-averaging when 7 
is in the correct range. The imaginary part of this quantity, calculated for 7 — 0, gives 
access to ®(A;, uj). The formula simplifies in the common case where the noise matrix X 
is a Wigner matrix, such that Rx(z) = oz. In this case, one finally obtains a Lorentzian 
shape for the squared overlaps: 


o2 


(u — à + 0? bR(A))* + otr? popà)? 


where we have decomposed the Stieltjes transform into its real and imaginary parts as 
gE(x) = bg(x) + ixpg(x);, note that h_(x) is equal to x times the Hilbert transform of 
PE(X). 

In Figure 19.1, we illustrate this formula in the case where C is a Wigner matrix with 
parameter o? = 1. For a fixed A, the overlap peaks for 


(A, u) = (19.11) 


u =À — 0° be (A), (19.12) 
with a width ~ o? pE(A). When o — 0, i.e. in the absence of noise, one recovers 
OA, u) > ô — u), (19.13) 


as expected since in this case the eigenvectors of E are trivially the same as those of C. 
Note that apart from the singular case o = 0, ®(A, u) is found to be of order unity when 
N — œ. But because of the factor N in Eq. (19.6), the overlaps between v; and uj are of 
order N~!/? as soon as o > 0. 
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Figure 19.1 Normalized squared-overlap function ®(A, u) for E = Xj + X2, the sum of two unit 
Wigner matrices, compared with a numerical simulation for y = —1.5 and u = 0.5. The simulations 
are for a single sample of size N = 2000, each data point corresponds to N times the square overlap 
between an eigenvector of E and one of Xj averaged over eigenvectors with eigenvalues within 
distance n = 2/J/N of u. 


Now suppose that E = C + X and E’ = C + X’, where X and X’ are two independent 
Wigner matrices with the same variance o?. Using Eq. (19.8), one can compute the 
expected overlap between the eigenvectors of E and F’. After a little work, one can estab- 
lish the following result for Y (À, å), i.e. the typical overlap around the same eigenvalues 


for E and E’: 
o? 04, f1 (A) 
WQA,A) = : y 19.14 
si 2 f(A) (8 f(A)? + (8) f(A)? wars 
where 
fiQ)=A—07 WEA; fo) =o7mpR(A); DEA) := RelA] (19.15) 


Note that in the large N limit, gg (z) = gg (2). 

The formula for Y (À, A’) is more cumbersome; for a fixed A’, one finds again a humped 
shaped function with a maximum at A’ ~ à. The most striking aspect of this formula, 
however, is that only gg (z) (which is measurable from data) is needed to compute the 
expected overlap (A, A’); the knowledge of the “true” matrix C is not needed to judge 
whether or not the observed overlap between the eigenvectors of E and E’ is compatible 
with the hypothesis that such matrices are both noisy versions of the same unknown C. 


19.1.3 Overlaps in the Multiplicative Case 


. 1 1 ; : 

We now repeat the same steps in the case where E = C2 W,C2, where W, is a Wishart 
matrix of parameter g. We know that in this case the matrix subordination formula reads as 
(13.47), which can be rewritten as 
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Cy SO ECU Mite Bees, (19.16) 
z 1 — q + qz9n(z) 


This allows us to compute 
Z(Ai — i 1 
uj GEA; — inu; = aS - i (19.17) 
: f ài— in Z(ài—inņn)— Hj 


and finally, taking the imaginary part in the limit n > OT, 


quà 
(u — q) — à + quà bE (A)? + q? u??? pRO 


where, again, bg denotes the real part of the Stieltjes transform gg. Note that in the limit 
q — 0, ®(A,j) becomes more and more peaked around A ~ yw, with an amplitude 
that diverges for q = 0. Indeed, in this limiting case, one should find that the sample 
eigenvectors v; become equal to the population ones u;. More generally, (A, u) for a 


(19.18) 


fixed u has a Lorentzian humped shape as a function of A, which peaks for A ~ u. 


Now suppose that E = C2W,C? and E! = C2 WiC?, where Wg and W% are two 
independent Wishart matrices with the same parameter g. Using Eq. (19.8), one can again 
compute the expected overlap between the eigenvectors of E and E’. The final formula is 
however too cumbersome to be reported here, see Bun et al. [2018]. The formula simplifies 


in the limit where C is close to the identity matrix, in the sense that t(C2) = 1 +e, with 
€ — Q. In this case: 


WA) = 1 +e [2 bp (A) — 11 [2060 — 1] +0). (19.19) 


More generally, the squared overlaps only depend on gg (z) (which is measurable from 
data). Again, the knowledge of the “true” matrix C is not needed to judge whether or not 
the observed overlap between the eigenvectors of E and E’ is compatible with the hypoth- 
esis that such matrices are both noisy versions of the same unknown C. This is particularly 
important in financial applications, where E and E’ may correspond to covariance matrices 
measured on two non-overlapping periods. In such a case, the hypothesis that the true C is 
indeed the same in the two periods may not be warranted and can be directly tested using 
the overlap formula. 


19.2 Rotationally Invariant Estimators 
19.2.1 Setting the Stage 


The results derived above concerning the overlaps between the eigenvectors of sample 
E and population (or “true”) C matrices allow one to construct a rotationally invariant 
estimator of C knowing E. The idea can be framed within the Bayesian approach of the 
previous chapter, when the prior knowledge about C is mute about the possible directions 
in which the eigenvectors of C are pointing. More formally, this can be expressed by saying 
that the prior distribution Po(C) is rotation invariant, i.e. 


Po(C) = Po(OCO*), (19.20) 
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where O is an arbitray rotation matrix. Examples of rotationally invariant priors are pro- 
vided by the orthogonal ensemble introduced in Chapter 5, where Po(C) only depends on 
Tr(C). 

Now, since the posterior probability of C given E is given by 


P(C\E) « (det C) 7T exp |-4 Tr (c~) | Po(C), (19.21) 


it is easy to verify that the MMSE estimator of C transforms in the same way as E under an 
arbitrary rotation O, i.e. 


=[C|OEO’ | = [ erqioro") PODC 


=O ll Č PEw)RODe| oO’ 


= OE(C\E)O’, (19.22) 


using the change of variable Č = O'CO, and the explicit form of P(C|E) given in Eq. 
(19.21). 

More generally, if we call & (E) an estimator of C given E, this estimator is rotationally 
invariant if and only if 


=E(OEO’) = OF(E)O’, (19.23) 


for any orthogonal matrix O. This means in words that, if the scm E is rotated by some O, 
then our estimation of C must be rotated in the same fashion. Intuitively, this is because we 
have no prior assumption on the eigenvectors of C, so the only special directions in which 
C can point are those singled out by E itself. Estimators abiding by Eq. (19.23) are called 
rotationally invariant estimators (RIE). 

An alternative interpretation of Eq. (19.23) is that E (E) can be diagonalized in the same 
basis as E, up to a fixed rotation matrix Q. But consistent with our rotationally invariant 
prior on C, there is no natural guess for Q, except the identity matrix 1. Hence we conclude 
that &(E) has the same eigenvectors as those of E, and write 


N 
E(E) = $ viv], (19.24) 
i=1 


where v; are, as above, the eigenvectors of E, and where &; is a function of all empirical 
eigenvalues [Aj] jea, n). We now show how these £; can be optimally chosen, and opera- 
tionally computed from data in the limit N > oo. 


19.2.2 The Optimal RIE 


Suppose we ask the following question: what is the optimal choice of &; such that &(E), 
defined by Eq. (19.24), is as close as possible to the true C? If the eigenvectors of E were 
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equal to those of C, i.e. if v; = uj, Vi, the solution would trivially be & = mi. But in 
the case where v; # uj, the solution is a priori non-trivial. So we want to minimize the 
following least-square error: 


N N 
Tr(S(E) — C) =) v/(S(E) — Cv; =). (e — 2&v? Cv; + vC’) . (19.25) 
i=l i=1 


Minimizing over & and noting that the third term in the equation above is independent of 
the &’s, it is easy to get the following expression for the optimal éz: 


& = v} Cv. (19.26) 


This is all very well but seems completely absurd: we assume that we do not know the true 
C and want to find the best estimator of C knowing E, and we find an equation for the € 
that we cannot compute unless we know C. 

Because Eq. (19.26) requires in principle knowledge we do not have, it is often called 
the “oracle” estimator. But as we will see in the next section, the large N limit allows one 
to actually compute the optimal €’s from the data alone, without having to know C. 


19.2.3 The Large Dimension Miracle 


Let us first rewrite Eq. (19.26) in terms of the overlaps introduced in Section 19.1. Expand- 
ing over the eigenvectors of C we find 


N N 
2 
Ek = So ypu jus ve = yi (uly) (19.27) 
j=l j=l 
=; fou PC(W) EP Ax, u), (19.28) 
N->0oo 


where A, is the eigenvalue of the sample matrix E associated with vg. In other words, ék 
is an average over the eigenvalues of C, weighted by the square overlaps ® (Ax, u). Now, 
using Eq. (19.5), we can also write 


N N 
2 1 
BEN m uve) = lim Im ` ulu; GEO — inu; 
k 3 i ( j k PEAK) 0+ = TEJ EAk J 
— ——__ lim Imr (CGg (Az — in). (19.29) 
TEAK) n>0+ 


We now use the fact that in both the additive and the multiplicative case, the matrices Gg 
and G¢ are related by a subordination equation of the type 


Ge(z) = Y(z)Gc(Z(z)), (19.30) 
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with Y (z) = 1 in the additive case and Y (z) = Z(z)/z in the multiplicative case. Hence we 
can write the following series of equalities: 


t (CGEC)) = Yt (CGe(Z)) = Yt (c — Z1 + ZI(Z1-— ert) 
= YZgc(Z) — Y = Z(z)gg(z) — Y (z). (19.31) 


But since Z (z) only depend on gg (z), we see that the final formula for & does not explicitly 
depend on C anymore and reads 


1 : : 
Ek = ——— lim Z(zx)an(ze) — Y (zk), Zk = AR — in. (19.32) 
TEAK) n> 0t 
Since all the quantities on the right hand side can be estimated from the data alone, this 
formula will lend itself to real world applications. Let us first explore this formula for two 
simple cases, for an additive model and for a multiplicative model. 


19.2.4 The Additive Case 


For a general noise matrix X, one has Z(z) = z — Rx(gg(z)), leading to the following 
mapping between the empirical eigenvalues à and the RIE eigenvalues £: 


lim, 0+ Im Rx (9g (z)) gE (z) 
lim,-, 0+ Im gg (z) 


EOJ =a , Z=A—in. (19.33) 


If there is no noise, i.e. X = 0 and hence Rx = 0, we find as expected (A) = A. If X is 
small, then 


Rx(x) =ex+---, (19.34) 
where we have assumed t(X) = 0 and € = t (X?) is small. Hence we find 


EA) =A—2e bE) +---. (19.35) 


A natural case to consider is when X is Wigner noise, for which Rx(x) = o? 


n X exactly, 
such that the equation above is exact with € = og, for arbitrary values of on. When C 


is another Wigner matrix with variance o2, then E is clearly also a Wigner matrix with 


variance o? = of + o2. In this case, when —20 < < 20, 
be) = (19.36) 
£ — 202 i 
Hence we obtain, from Eq. (2.38), 
ho? of 
gA)=A-SBark, ries, (19.37) 
o o 
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which is the linear shrinkage obtained for Gaussian variables in Chapter 18. In fact, this 
shrinkage formula is expected elementwise, since all elements are Gaussian random vari- 
ables:! 


a (E); = r Ejj, (19.38) 
see Eq. (18.10) with xọ = 0. 


Exercise 19.2.1 Additive RIE for the sum of two matrices from the same distri- 
bution 
In this exercise we will find a simple form for a RIE estimator when the noise 
is drawn from the same distribution as the signal, i.e. 


E=C+X, (19.39) 


with X and C mutually free matrices drawn from the same ensemble. 


(a) Write a relationship between Rx(g) and Rg(e). 

(b) Given that gg (z) Re (gg (z)) = z9E(z) — 1, what is gg (z) Rx (GE(z))? 

(c) Use Eq. (19.33) and the fact that z is real in the limit n —> O" to show that 
&(A) = 1/2. 

(d) Given that & = E[C]g (see Section 19.4), find a simple symmetry argument 
to show that & = E/2. 

(e) Generate numerically two independent symmetric orthogonal matrices M; 
and M, with N = 1000 (see Exercise 1.2.4). Compute the eigenvalues A, 
and eigenvectors vg of the sum of these two matrices. 

(£) Plot the normalized histogram of the A,’s and compare with the arcsine law 
between —2 and 2 (p(A) = 1/(a V4 — 42)). 

(g) Make a scatter plot v; Mi vx vs Ax and compare with 2/2. 


19.2.5 The Multiplicative Case 


We can now tackle the multiplicative case, which includes the important practical problem 
of estimating the true covariance matrix given a sample covariance matrix. In the mul- 
tiplicative case, it is more elegant to use the subordination relation Eq. (13.46) for the 
T-matrix rather than for the resolvent. In the present setting we thus write 


TeC) = Telz Sw(te(<))]. (19.40) 


! Note however that there is a slight subtlety here: the linear shrinkage equation (19.37) only holds in the absence of outliers, i.e. 
empirical eigenvalues that fall outside the interval (—20,2o). For such eigenvalues, shrinkage is non-linear. For a similar 
situation in the multiplicative case, see Figure 19.2. 
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In terms of T-transforms, Eq. (19.29) reads 


lim, 0+ Im t(CTc¢lz Sw(te(z))) | 


EA = ; Z=A—in. (19.41) 


lim, 59+ Im ta(a) 
Since Te(z) = C(z1 — C)~!," we have, with t = tg(z) as a shorthand, 
t (CT e(2Swi))] = t [C*eSwiod — O-'] 
=r [ec — 2Sw(t)1 + zSw() (Swi) — o7] 
= T (C) + zSw(t)te(zSw(t)) 
= t (C) + zSw(t)te(2). (19.42) 


The first term t (C) is real and does not contribute to the imaginary part that we have to 

compute, so we obtain 

lim,.o+ Im Sw (te (2))te (2) 
lim, _,o+ Im tg (z) 


EA) =A 


, Z=A—in. (19.43) 


Equation (19.43) is very general. It applies to sample covariance matrices where the noise 
matrix W is a white Wishart, but it also applies to more general multiplicative noise 
processes. 

In the special case of sample covariance matrices E = C2Ww,C? with N/T = q, we 
know that Sw, (t) = (1 + qt)~!. In the bulk region AL < A < Ay, t = tg(z) is complex 
with non-zero imaginary part when z = A — in. Hence 

lim,—.o+ Im aa à 


1) =å = ; 19.44 
5) lim,_,o+ Imt [1 + qte(a — in)!" |, +0 ce 


where we have used the fact that 
t t(1+qt*) ttle 1 
= Im ————— = Im = Im 
1+qt |1 + qt|? ll+qt\? |l+qt|? 
Equation (19.44) can be interpreted as a form of non-linear shrinkage. A way to see this is 


to note that below A_ and above A (the edges of the sample spectrum) tg (A) is real. From 
the very definition, 


Im t. (19.45) 


/ 


= AgE(A) — l, (19.46) 


a: f: / À 
w=] dd’pe (a!) — = 


for any A outside or at the edges of the spectrum. Hence, since A_ > 0 for covariance 
matrices, tg(à—) < 0 and tg(à+) > 0. Hence, one directly establishes that the support of 
the RIE & (E) is narrower than that of E: 


EA_)EA-~ €A4) <4, (19.47) 


where the inequalities are saturated for q = 0, in which case, as expected (à) = 4, 
Và e (A_,A+). A more in-depth discussion of the properties of Eq. (19.44) is given in 
Section 19.3. 
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Figure 19.2 The RIE estimator (19.44) for a true covariance matrix given by an inverse-Wishart of 
variance p = 0.25 observed using data with aspect ratio g = 0.25. On the support of the sample 
density (A € [0.17,3.33]), the RIE matches perfectly the linear shrinkage estimator (19.49) (with 
r = 1/2), but it is different from it outside of the expected spectrum. 


Using Eq. (19.46), the shrinkage equation (19.44) can be rewritten as 
À 
I1 — q +q — inl? |, 0+ 
a result first derived in Ledoit and Péché [2011]. 
Equation (19.44) considerably simplifies in the case where the true covariance matrix C 
is an inverse-Wishart matrix of parameter p. Injecting the explicit form of tg (z) given by 


Eq. (15.50) into Eq. (19.44) leads, after simple manipulations, to 


À 
(jt a e (19.49) 
pt+q pt+4q 
i.e. exactly the linear shrinkage result derived in a Bayesian framework in Section 18.3. 
Note that the result Eq. (19.49) only holds between A_ and A, given in Eq. (15.52). The 


full function (à) when C is an inverse-Wishart matrix is given in Figure 19.2. 


EA) = (19.48) 


Exercise 19.2.2 RIE when the true covariance matrix is Wishart 
Assume that the true covariance matrix C is given by a Wishart matrix with 
parameter qo. This case is a tractable model for which the computation can be 
done semi-analytically (we will get cubic equations!). 
We observe a sample covariance matrix E over T = qN time intervals. E is 
the free product of C and another Wishart matrix of parameter q: 


E=C?2WwCc?. (19.50) 
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(a) Given that the S-transform of the true covariance is Sc(t) = 1/(1 + got) and 
the S-transform of the Wishart is Sw(t) = 1/(1 + qt), use the product of 
S-transforms for the free product and Eq. (11.92) to write an equation for 
te(z). It should be a cubic equation in t. 

(b) Using a numerical polynomial solver (e.g. np.roots) solve for tg (z) for z real 
between 0 and 4, choose go = 1/4 and q = 1/2. Choose the root with positive 
imaginary part. Use Eqs. (11.89) and (2.47) to find the eigenvalue density and 
plot this density. The edge of the spectrum should be (slightly below) 0.05594 
and (slightly above) 3.746. 

(c) For A in the range [0.05594, 3.746] plot the optimal cleaning function (use the 
same solution tg (z) as in (b)): 

À 

11 + qt)? 

(d) For N = 1000 numerically generate C (qo = 1/4), two versions of W1 2 
(q = 1/2) and hence two versions of Ei 2 := C2W12C7. E; will be the 
“in-sample” matrix and E2 the “out-of-sample” matrix. Check that t (C) = 
t(Wi2) = t(Ei2) = 1 and that r(C*) = 1.25, t (Wi 5) = 1.5 and 
rka) = i75: 

(e) Plot the normalized histogram of the eigenvalues of Ej, it should match your 
plot in (b). 

(f) For every eigenvalue, eigenvector pair (Ax, vx) of E} compute éva (Ax) := 
v; E2vVk. Plot éva (àx) vs Ax, and compare with your answer in (c). 


E) = (19.51) 


Exercise 19.2.3 Multiplicative Rœ when the signal and the noise have the same 
distribution 


(a) Adapt the arguments of Exercise 19.2.1 to the multiplicative case with C and 
W two free matrices drawn from the same ensemble. Show that in this case 


EA) = VA. 
(b) Redo Exercise 19.2.2 with q = qo = 1/4, compare your (A) with Jd. 


19.2.6 RIE for Outliers 


So far, we have focused on “cleaning” the bulk eigenvectors. But it turns out that the 
formulas above are also valid for outliers of C that appear as outliers of E. One can show 
that, outside the bulk, gp (z) and tg(z) are analytic on the real axis and thus, for small n, 


Im gp (A — in) = —7gR (A), Imtg(A — in) = -ntk à). (19.52) 


Then Eqs. (19.33) and (19.43) simplify to 
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d 
EA) =A- T [gkx(g)], g = gE(A), 


d (19.53) 
EA =A g ESWL t= tO), 


respectively for the additive and multiplicative cases. 


19.3 Properties of the Optimal RIE for Covariance Matrices 


Even though the optimal non-linear shrinkage function (19.44), (19.48) seems relatively 
simple, it is not immediately clear what is the effect induced by the transformation à; —> 
&(A;). In this section, we thus give some quantitative properties of the optimal estimator & 
to understand the impact of the optimal non-linear shrinkage function. 

First let us consider the moments of the spectrum of &. From Eqs. (19.24) and (19.26) 
we immediately derive that 


TrE=) uju (x vv) uj =TrC, (19.54) 
j=l i=l 


meaning that the cleaning operation preserves the trace of the population matrix C, as it 
should do. For the second moment, we have 


N N 

m2 2 2; 

Tre? = Yo jue viu (ve ux)’. 
j,k=1 i=l 


Now, if we define the matrix A jg as yy (vu)? (v? uy)? for j,k = 1, N, it is not hard to 
see that it is a matrix with non-negative entries and whose rows all sum to unity (remember 
that all v;’s are normalized to unity). The matrix A is therefore a (bi-)stochastic matrix 
and the Perron—Frobenius theorem tells us that its largest eigenvalue is equal to unity (see 
Section 1.2.2). Hence, we deduce the following general inequality: 


N N 
2 
Yo Ajkujuk <} ý, 
jk=1 j=1 


which implies that 
Tr E? < Tr C? < TrE?, (19.55) 


where the last inequality comes from Eq. (17.11). In words, this result states that the 
spectrum of & is narrower than the spectrum of C, which is itself narrower than the 
spectrum of E. The optimal RIE therefore tells us that we had better be even more cautious 
than simply bringing back the sample eigenvalues to their estimated true locations. This is 
because we have only partial information about the true eigenbasis of C. In particular, one 
should always shrink downward (resp. upward) the small (resp. top) eigenvalues compared 
to their true locations u; for any i € (1, N), except for the trivial case C = 1. 


310 Eigenvector Overlaps and Rotationally Invariant Estimators 


Next, we consider the asymptotic behavior of the optimal non-linear shrinkage function 
(19.44), (19.48). Throughout the following, suppose that we have an outlier at the left of 
the lower bound of ppg and let us assume q < 1 so that E has no exact zero mode. We know 
from Section 19.2.6 that the estimator (19.44) holds for outliers. Moreover, we have that 
AgE(A) = O(A) for A — 0. This allows us to conclude from Eq. (19.26) that, for outliers 
very close to zero, 


EQ) = aa +0(27), (19.56) 


which is in agreement with Eq. (19.55): small eigenvalues must be pushed upwards for 
q > 0. 

The other asymptotic limit à — oo is also useful since it gives us the behavior of the 
non-linear shrinkage function £ for large outliers. In that case, we know from Eq. (17.8) 
that lim} oo Àtg (à) ~ A7!1(E). Therefore, we conclude that 


À -1 
(A) © 5 © A—2gt(E) +071). (19.57) 


(1 4 ga-!t(E) + oa) 


If all variances are normalized to unity such that t (E) = t (C) = 1, then we simply obtain 
E(A) © à — 2q + O07). (19.58) 


It is interesting to compare this with Eq. (14.54) for large rank-1 perturbations, which gives 
à w+q forà > oo. As a result, we deduce from Eq. (19.58) that &(A) ~  — q and we 
therefore find the following ordering relation: 


EQ) <u <À, (19.59) 


for isolated and large eigenvalues à and for q > 0. Again, this result is in agree- 
ment with Eq. (19.55): large eigenvalues should be reduced downward for any q > 0, 
even below the “true” value of the outlier u. More generally, the non-linear shrink- 
age function & interpolates smoothly between à/(1 — q)? for small à to 4 — 2q for 
large À. 


19.4 Conditional Average in Free Probability 


In this section we give an alternative derivation of the RIE formula, Eq. (19.29). This 
derivation is more elegant, albeit more abstract. In particular, it does not rely on the 
computation of eigenvector overlap, so by itself it misses the important link between the 
RIE and the computation of overlaps. 

In the context of free probability, we work with abstract objects (E, C, etc.) that satisfy 
the axioms of Chapter 11. We can think of them as infinite-dimensional matrices. We are 
given the matrix E that was obtained by free operations from an unknown matrix C. For 
instance it could be given by a combination of free product and free sum. 

The matrix E is generated from the matrix C; in this sense, E depends on C. We would 
like to find the best estimator (in the least-square sense) of C given E. It is given by the 
conditional average 


0 
ll 
= 
Q 
5 


(19.60) 


19.5 Real Data 311 


In this abstract context, the only object we know is E so & must be a function of E. Let us 
call this function & (E). The fact that & is a function of E only imposes that & commutes 
with E, i.e. that & is diagonal in the eigenbasis of E. One way to determine the function 


&(E) is to compute all possible moments of the form mg = T[E(E)E*). They can be 
combined in the function 


FO =r [iee] (19.61) 
via its Taylor series at z > oo. Using Eq. (19.60), we write 
F(Z)=t [EIC c1-57]. (19.62) 


But the operator t[.] contains the expectation value over all variables, both trace and 
randomness. So by the law of total expectation, t(E[.]) = t(.) and 


F& =rt|ce1- 57]. (19.63) 
To recover the function &(A) from F(z) we use a spectral decomposition of E: 
F(z) = [rem 22a, (19.64) 
so 
Bee Im F(A — in) = mpg (WEA), (19.65) 
which is equivalent to 
e= im eee (19.66) 


n=>0+ ImgR(A — in)’ 
itself equivalent to Eq. (19.29). 


19.5 Real Data 


As stated above, the good news about the RIE estimator is that it only depends on transforms 
of the observable matrix E, such as gg (z) and tg (z) and the R- or S-transform of the noise 
process. One may think that real world applications should be relatively straightforward. 
However, we need to know the behavior of the limiting transforms on the real axis, precisely 
where the discrete N transforms gy (z) and ty (z) fail to converge. 

We will discuss here how to compute these transforms using either a parametric fit or 
a non-parametric approximation on the sample eigenvalues. In both cases we will tackle 
the multiplicative case with a Wishart noise but the discussion can be adapted to cover 
the additive case or any other type of noise. In Section 19.6 we will discuss an alternative 
approach using two datasets (or disjoint subsets of the original data). 


19.5.1 Parametric Approach 


Ansatz on pc or Sc 


One can postulate a convenient functional form for poc (à) and fit the associated parameters 
on the data. This allows one to obtain analytical formulas for all the relevant transforms, 
from which one can extract the exact behavior on the real axis. 
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The simplest (most tractable) choice for oc(A) is the inverse-Wishart distribution. In 
this case pg(à) can be computed exactly (see Eq. (15.51)) and the optimal estimator is 
linear within the bulk of the spectrum, cf. Eq. (19.49). When the sample covariance matrix 
is normalized such that t(E) = 1, the inverse-Wishart has a single parameter p that 
needs to be estimated from the data. As an estimate, one can use for example the second 
moment of E: 


t (E?) = 1 py, (19.67) 
or its first inverse moment: 
1 
7 (E~!) at (19.68) 
l-q 


which is obtained using Eq. (15.13) with Sg(t) = (1 — pt)/(. + qt), or simply by 
noting that t(W,' M7!) = t(W7 Dr (M3 ') for free matrices, and using the results of 
Sections 15.2.2 and 15.2.3. 

When the distribution of sample eigenvalues appears to be bounded from above and 
below, one can use a more complicated but still relatively tractable ansatz for oc (à), by 
postulating a simple form for its S-transform. For example using 

( — pit) — p2t) (1 — pit)(1 — pot) 


Sc(t) = iat e Sef) = (d+qod+ait)’ 


(19.69) 


one finds that tg(¢) (and hence pg(A)) is the solution of a cubic equation. Higher order 
terms in ¢ in the numerator or the denominator will give higher order equations for tp (¢). 
The parameters pj, p2, q1, etc. can be evaluated from the first few moments and inverse 
moments of E or by fitting the observed density of eigenvalues. However, the particularly 
convenient choice Eq. (19.69) does not work when the observed distribution of eigenvalues 
does not have enough skewness, as in the example shown in Figure 19.3. 


Parametric Fit of pg 


Another approach consists of postulating a form for the density of sample eigenvalues and 
fitting its parameters. For example, one can postulate that 


+a + ad?) /A=aA_)A4 AÀ) 


A= Z 
PEQ) 1 F bIà + b24? 


; (19.70) 


where A+ are fixed to the smallest/largest observed eigenvalues, and a1, a2, b; and b2 
are fitted on the data by minimizing the square error on the cumulative distribution. The 
normalization factor Z can be computed during the fitting procedure. This particular form 
fits very well for sample data generated numerically (see Fig. 19.3 left). To find the optimal 
shrinkage function (19.48), we then reconstruct the complex Stieltjes transform g(x —i0*) 
numerically, by using the fitted o(A) and computing its Hilbert transform. The issue with 
such an approach is that even when Eq. (19.70) is a good fit to the sample density of 
eigenvalues, it cannot be obtained as the result of the free product of a Wishart and some 
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Figure 19.3 Parametric fit illustrated on an example where the true covariance has a uniform density 
of eigenvalues with mean 1 and variance 0.2 (see Eq. (15.42)). A single sample covariance matrix 
with N = 1000 and g = 0.4 was generated, and the ad-hoc distribution (19.70) was fitted to the 
eigenvalue cpF. The left-hand figure shows a histogram of the sample eigenvalues compared with 
the theoretical distribution and the ad-hoc fit. The right-hand figure shows the theoretical optimal 
shrinkage and the one obtained from the fit. The agreement is barely satisfactory, in particular the 
shrinkage from the fit is non-monotonic. The dots show the oracle estimator && = v Cv computed 
within the same simulation. 


given density. As a consequence the approximate estimator generated by such an ansatz is 
typically non-monotonic, whereas the exact shrinkage function should be.? 


The Case of an Unbounded Support 


On some real datasets, such as financial time series, it is hard to detect a clear boundary 
between bulk eigenvalues and the large outliers. In this case one may suspect that the 
distribution of eigenvalues of the true covariance matrix C is itself unbounded. In that case, 
one may try a parametric fit for which the density of C extends to infinity. For example, if 
we suspect that the true distribution has a sharp left edge but a power-law right tail, we may 
choose to model pc (à) as a shifted half Student’s t-distribution, i.e. 


l+u 


1+ — litu 
(A) = @(A—A_) z (=) re Ei i (19.71) 
pc(a) = Tyme Te Pa 


where @(A — A_) indicates that the density is non-zero only for A > A_, chosen 
to be the center of the Student’s t-distribution. These densities do not have an upper 
edge, instead they fall off as p(A) ~ 4747! for large A. For integer values of the tail 


2 Although we have not been able to find a simple proof of this property, we strongly believe that it holds in full generality. 
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exponent u, the Stieltjes transform gc(z) can be computed analytically. For example for 
u = 3 we find 


V3ru3 + bau? + 9a? /3u + 36a? log (-u/ V3a?) + 18a3 
J3n (3a? + Ww)? 


where u = z — A_. Note that this Stieltjes transform has an essential singularity at z = A_ 
and a branch cut on the real axis from A_ to +00 indicating that the density has no upper 
bound. For u = 3 both the mean and the variance of the eigenvalue density are finite. We 
thus fix A. = 1-2 V3a2 /x such that t(C) = 1 and adjust a to obtain the desired variance 
given by t(C?) — 1 = 3a? (1 — (2/m)’). 

In cases like this one, where we have an analytic form for g¢(z) but no simple formula 
for its S-transform, we can numerically solve the subordination relation 


Gy=3(Z) = . (19.72) 


g 
with tc(¢) = €gc(¢) — 1, using an efficient numerical fixed point equation solver. Most of 
the time a simple iteration would find the fixed point, but for some values of ¢ and q it is 
sometimes difficult to find an initial condition for the iteration to converge so it is better to 
use a robust fixed point solver. 

Let us end on a technical remark: for unbounded densities, g(z) is not analytic at z = 00, 
which does not conform to some hypotheses made throughout the book. Intuitively, there 
is no longer any clear distinction between bulk eigenvalues and outliers. For a fixed value 
of N, and for sufficiently large à, the distance between two successive eigenvalues will at 
some point become much larger than 1/N. Fortunately, the very same RIE formula holds 
both for bulk and for outlier eigenvalues, so we can close our eyes and safely apply Eq. 
(19.27) for unbounded densities as well. 


19.5.2 Kernel Methods 


Another approach to compute the Stieltjes and/or the T-transform on the real axis is to 
work directly with the discrete eigenvalues A; of E. As stated earlier we cannot simply 
evaluate the discrete gy(z) at a point z = A, because gy(z) is infinite precisely at the 
points z € {àx}; this is the reason why gy (z) does not converge to the limiting gg (z) on the 
support of po (À). 

The idea here is to generalize the standard kernel method to estimate continuous den- 
sities from discrete data. Having observed a set of N eigenvalues [A,]xe(1, N), a Smooth 
estimator of the density is constructed as 


1 N 
oa) =F 2 Kn (x — à), (19.74) 
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where K, is some adequately chosen kernel of width 7 (possibly k-dependent), normalized 
such that 


+00 
f du K,(u) = 1, (19.75) 
—oo 
such that 
+00 
f dx ps(x) = 1. (19.76) 
—oo 


A standard choice for K is a Gaussian distribution, but we will discuss more appropriate 
choices for the Stieltjes transform below. 
Now, let us similarly define a smoothed Stieltjes transform as 


1 N 
B) = 57D) IK mE — Ax) (19.77) 
k=1 


where gx, is the Stieltjes transform of the kernel K,, treated as a density: 
ee Ky (u) 
9K,n (z) := du ———; Im(z) + 0. (19.78) 
Lie, z—u 
Note that since Im gx,,(x — i0*) = in K n(x), one immediately concludes that 
Im gs(x — i0T) = ir ps(x) (19.79) 


for any smoothing kernel K,,. Hence gs(z) is the natural generalization of smoothed densi- 
ties for Stieltjes transforms. Correspondingly, the real part of the smoothed Stieltjes is the 
Hilbert transform (up to a x factor) of the smoothed density, i.e. 


bs (x) := Re gs(x — i0*) =f da— 


ok (19.80) 
—0o 


Two choices for the kernel Ky} are specially interesting. One is the Cauchy kernel: 


Kew = > 1 (19.81) 


T u2 + n2 3 
from which one gets 


1 
SKC, n(Z) = ——_, + = sign (Im(z)). (19.82) 
zZx1n 
Hence, in this case, we find that the smoothed Stieltjes transform we are looking for is 
nothing but the discrete Stieltjes transform computed with a k-dependent width ng: 


N 


1 1 
C 
SOE T 0, 19.83 
95 (2) = FF Le iy = img m(z) < (19.83) 


which we can now safely compute numerically on the real axis, i.e. when z = x —i07, and 
plug in the corresponding formulas for the RIE estimator & (A). 
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Figure 19.4 Non-parametric kernel methods applied to the same problem as in Figure 19.3. An 
approximation of gg(4 — i0+) is computed with the Cauchy kernel (19.82) and the Wigner kernel 
(19.85) both with nk = n = N mee (left) We compare the two smoothed densities with the 
theoretical one. Both are quite good but the Wigner kernel is better where the density changes rapidly. 
(right) From the smoothed Stieltjes transforms we compute the shrinkage function for both methods. 
Only the result of the Wigner kernel is shown (the Cauchy kernel is comparable albeit slightly 
worse). Kernel methods give non-monotonic shrinkage functions which can be easily rectified using 
an isotonic regression (19.86), which improves the agreement with the theoretical curve. 


Another interesting choice for numerical applications is the semi-circle “Wigner kernel”, 


which has sharp edges. To wit, 
41? — y2 
27n? 


and 0 when |u| > 27. In this case, we obtain 


N 2 

1 Z— Àk 4n 

w k 
g; (z) := 1 1 . 19.85 
g (2) VŽ T "eee cea ea (19.85) 


Figure 19.4 gives an illustration of the kernel method using both the Cauchy kernel and the 
Wigner kernel, with np = n = N7}/2, 
We end this section with two practical implementation points regarding the kernel 


Kř (u) = for —2n <u < 2n, (19.84) 


methods. 


1 Since the optimal RIE estimator (à) should be monotonic in A, one should rectify 
possibly non-monotonic numerical estimators using an isotonic regression. The isotonic 
regression x of some data yx is given by 


T 

A : A 2 ; A A A a 

Ye = argmin ` (Sk — yk) with ĵi < So <- < Îr- < Îr. (19.86) 
k=1 
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It is the monotonic sequence that is the closest (in the least-square sense) to the original 
data. 

2 In most situations, we are interested in reconstructing the optimal RIE matrix & = 
Dé (AK)VEV;. and hence we need to evaluate the shrinkage function (à) precisely at the 
sample eigenvalues {àg}. We have found empirically that excluding the point A, itself 
from the kernel estimator consistently gives better results than including it. For example, 
in the Cauchy case, one should compute 


1 


is (19.87) 
e — Ak — ink 


N 
1 
C ‘cyt 
“(àg — 107) ~ 
a5 (Ae — 10") * —— Dis 


=! 
+l 


aa 


when estimating Eq. (19.48). 


19.6 Validation and RIE 


The idea of validation to determine the RIE is to compute the eigenvectors v; of E the scm 
of a training set and compute their unbiased variance of a different dataset: the validation 
set. More formally, this is written 


Ex (Aj) = v; E'v;, (19.88) 


where E’ is the validation scm. The training set is also called the in-sample data and the 
validation set the out-of-sample data. 

In practical applications, we have typically a single dataset that needs to be split into 
a training and a validation set. If we are not too worried about temporal order, any block 
of the data can serve as the validation set. In K -fold cross-validation, the data is split into 
K blocks, one block is the validation set and the union of the K — 1 others serves as the 
training set. The procedure is then repeated successively choosing the K possible validation 
sets, see Figure 19.5. 

In the following, we will assume that the true covariance matrix C is the same on both 
datasets so that E = C2 WC? and E’ = C2 W'C? , where W’ is independent from W. 

Expanding over the eigenvectors vj of E’, we get 


N 


& (i) = Do (viT vy)” Ay (19.89) 


k=1 


or, in the large N limit and using the definition of W given in Eq. (19.6), 
Ex (A) a f pE (ANY (A, ADA da’. (19.90) 
=> 


Now, there is an exact relation between WV and ®, which reads 


1 N 
War) = r > D(A, Wj)’, Wj) (19.91) 
j=l 
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or, in the continuum limit, 


YA, A) = f eea meow du. (19.92) 


Intuitively, this relation can be understood as follows: we expect that, from the very 
definition of ®, the eigenvectors of E and E’ can be written as 


N N 
1 1 
vi = — Y cj, /OAiu;)u;; v, = —— y lp, ue) ug, (19.93) 


where u; are the eigenvectors of C and ¢;; are independent random variables of mean 
zero and variance one, such that 


Eleizexe] = dix5je,  Ele;jekel = dikdje,  Eleijegel = 0. (19.94) 


This so-called ergodic assumption can be justified from considerations about the Dyson 
Brownian motion of eigenvectors, see Eq. (9.10), but this goes beyond the scope of this 
book. In any case, if we now compute E[(v? v)? using the ergodic assumption and 
remembering that upu = ô je, we find 


N 
7 1 
NEIVA VO] = 57D) PAH jPA nj), (19.95) 
j=l 


which is precisely Eq. (19.91). 
Injecting Eq. (19.91) into Eq. (19.89), we thus find 


N N 
1 
KO= (oan eainy | a 
k=1 \j=l 


N N 
1 
= ye D2 20u) (> voiu) ; (19.96) 


j=! k=1 


The last term in parenthesis can be computed by using the very definition of E’: 


N 
SOA, uj) = Nuj Euj 
k=1 
= NC utW'C2u; = Nu jut W'uj. (19.97) 
Now, the idea is that since W’ is independent of C, averaging over any small interval of u j 


will amount to replacing uj’ W'u; by its average over randomly oriented vectors u, which 
is equal to unity: 


‘tu’ Wu] = t(W'uu’ ) = t(W’)t(uu’) = 1. (19.98) 


Hence, from Eq. (19.96) we finally obtain 


1 N 
OE 3 P(A Mj) i > | pelu) BOA, u) udp, (19.99) 
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Figure 19.5 Shrinkage function (A) computed for the same problem as in Figure 19.3, now using 
cross-validation. The dataset is divided into K = 10 blocks of equal length. For each block, we 
compute the N = 1000 eigenvalues aP and eigenvectors v? of the sample covariance matrix using 
the rest of the data (of new length 97/10), and compute £x Qa? y= v H v, with E’ the sample 


covariance matrix of the considered block. The dots correspond to the 10 x 1000 pairs ab „Ex ab )). 
The full line is an isotonic regression through the dots. The procedure has a slight bias as we in fact 
compute the optimal shrinkage for a value of q equal to qx = 10N/9T, but otherwise the agreement 
with the optimal curve is quite good. 


which precisely coincides with the definition of the optimal non-linear shrinkage function 
&(A), see Eq. (19.27). 

This result is very interesting and indicates that one can approximate £ (à) by considering 
the quadratic form between the eigenvectors of a given realization of C — say E — and 
another realization of C — say E’ — even if the two empirical matrices are characterized by 
different values of the ratio N/T. This method is illustrated in Figure 19.5. 
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Applications to Finance 


20.1 Portfolio Theory 


One of the arch-problems in quantitative finance is portfolio construction. For example, one 
may consider an investment universe made of N stocks (or more generally N risky assets) 
that one should bundle up in a portfolio to achieve optimal performance, according to some 
quality measure that we will discuss below. 


20.1.1 Returns and Risk Free Rate 


We call p;,; the price of stock i at time ¢ and define the returns over some elementary time 
scale (say one day) as 

Pi,t—1 : 

The portfolio weight zr; is the dollar amount invested on asset i, which can be positive 
(corresponding to buys) or negative (corresponding to short sales). The total capital to be 
invested is C. Naively, one should have C = X; Txi. But one can borrow cash, so that 
X; 71 > C and pay the risk free rate ro on the borrowed amount, or conversely under- 
invest in stocks (_; 2; < C) and invest the remaining capital at the risk free rate, assumed 
to be the same ro.! Then the total return of the portfolio (in dollar terms) is 


R, =) riria + (C — È mi)ro, (20.2) 


Fit — 


(20.1) 


so that the excess return (over the risk free rate) is 
R, — Cro := È 7t (fi — ro), (20.3) 
i 
where r;,; — ro is the excess return of asset i. From now on, we will denote r;,; — ro by 7,1. 


We will assume that these excess returns are characterized by some expected gains g; and 
a covariance matrix C, with 


l In general, the risk free rate to borrow is different from the one to lend, but we will neglect the difference here. 
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Ci; = Cov[rir;]. (20.4) 


The problem, of course, is that both the vector of expected gains g and the covariance 
matrix C are unknown to the investor, who must come up with his/her best guess for these 
quantities. Forming expectations of future returns is the job of the investor, based on his/her 
information, anticipations, and hunch. We will not attempt to model the sophisticated 
process at work in the mind of investors, and simply assume that g is known. In the simplest 
case, the investor has no preferences and g = g1, corresponding to the same expected return 
for all assets. Another possibility is to assume that g is, for all practical purposes, a random 
vector in RY. 

As far as C is concerned, the most natural choice is to use the sample covariance matrix 
E, determined using a series of past returns of length T. However, as we already know 
from Chapter 17, the eigenvalues of E can be quite far from those of C when g = N/T 
is not very small. On the other hand, T cannot be as large as one could wish, the most 
important reason being that the (financial) world is non-stationary. For a start, many large 
firms that exist in 2019 did not exist 25 years ago. More generally, it is far from clear that 
the parameters of the underlying statistical process (if such a thing exists) can be considered 
as constant in time, so mixing different epochs is in general not warranted. On the other 
hand, due to experimental constraints, the limitation of data points can be a problem even 
in a stationary world. 


20.1.2 Portfolio Risk 
The risk of a portfolio is traditionally measured as the variance of its returns, namely 
R? := V[R] = 5 min jCov[rirj] = x" Cr. (20.5) 
i,j 
Other measures of risk can however be considered, such as the expected shortfall Sp (or 
conditional value at risk), defined as 


1 fPr 
Sp =-— f dR R P(R), (20.6) 
P J-oo 
where Rp is the p-quantile, with for example p = 0.01 for the 1% negative tail events: 
Rp 
p= / dR P(R). (20.7) 
—oo 


If P(R) is Gaussian, then all risk measures are equivalent and subsumed in V[R]. 


20.1.3 Markowitz Optimal Portfolio Theory 


For the reader not familiar with Markowitz’s optimal portfolio theory, we recall in this 
section some of the most important results. Suppose that an investor wants to invest in a 


20.1 Portfolio Theory 323 


portfolio containing N different assets, with optimal weights x to be determined. An intu- 
itive strategy is the so-called mean-variance optimization: the investor seeks an allocation 
such that the variance of the portfolio is minimized given an expected return target. It is not 
hard to see that this mean-variance optimization can be translated into a simple quadratic 
optimization program with a linear constraint. Markowitz’s optimal portfolio amounts to 
solving the following quadratic optimization problem: 


| min,zepy 4r" Cr (20.8) 


subject tom’ g > G 


where G is the desired (or should we say hoped for) gain. Without further constraints — such 
as the positivity of all weights necessary if short positions are not allowed — this problem 
can be easily solved by introducing a Lagrangian multiplier y and writing 


1 
min -x Cr — yr’ g. (20.9) 
meRN 2 

Assuming that C is invertible, it is not hard to find the optimal solution and the value of y 
such that overall expected return is exactly G. It is given by 


a (20.10) 


which, as noted above, requires the knowledge of both C and g, which are a priori 
unknown. Note that even if the predictions g; of our investor are completely wrong, it still 
makes sense to look for the minimum risk portfolio consistent with his/her expectations. 
But we are left with the problem of estimating C, or maybe C7! before applying 
Markowitz’s formula, Eq. (20.10). We will see below why one should actually find the 
best estimator of C itself before inverting it and determining the weights. 

What is the risk associated with this optimal allocation strategy, measured as the variance 
of the returns of the portfolio? If one knew the population correlation matrix C, the true 
optimal risk associated with 1¢ would be given by 


2 
gcc tg. 
However, the optimal strategy (20.10) is not attainable in practice as the matrix C is 


unknown. What can one do then, and how poorly is the realized risk of the portfolio 
estimated? 


Re. = nhac = (20.11) 


20.1.4 Predicted and Realized Risk 


One obvious — but far too naive — way to use the Markowitz optimal portfolio is to apply 
(20.10) using the scm E as is, instead of C. Recalling the results of Chapter 17, it is not 
hard to see that this strategy will suffer from strong biases whenever T is not sufficiently 
large compared to N. 
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Notwithstanding, the optimal investment weights using the scm E read 


gE? (20.12) 
TE = G———., : 
e YgETg 
and the minimum risk associated with this portfolio is thus given by 
2 
2 G 


which is known as the “in-sample” risk, or the predicted risk. It is “in-sample” because it 
is entirely constructed using the available data. The realized risk in the next period, with 
fresh data, is correspondingly called out-of-sample. 

Using the convexity with respect to E of g’E~'g, we find from the Jensen inequality 
that, for fixed predicted gains g, 


[g E g] > g’E[E] 'g=g"Co'g, (20.14) 


where the last equality holds because E is an unbiased estimator of C. Hence, we conclude 
that the in-sample risk is lower than the “true” risk and therefore the optimal portfolio 
xg Suffers from an in-sample bias: its predicted risk underestimates the true optimal risk 
Rtrue- Intuitively this comes from the fact that xg attempts to exploit all the idiosyncracies 
that happened during the in-sample period, and therefore manages to reduce the risk below 
the true optimal risk. But the situation is even worse, because the future out-of-sample or 
realized risk, turns out to be larger than the true risk. Indeed, let us denote by E’ the scm of 
this out-of-sample period; the out-of-sample risk is then naturally defined by 
7 "1E E Gg EIEE !g 
(g"E-!g)? 

For large matrices, we expect the result to be self-averaging and given by its expectation 
value (over the measurement noise). But if the measurement noise in the in-sample period 
(contained in wg) can be assumed to be independent from that of the out-of-sample 
period, then zx and E’ are uncorrelated and we get, for N > oo, 


2 
Rout 


(20.15) 


xgE ng = mgCTE. (20.16) 
Now, from the optimality of xc, we also know that 

acCac < npCrg, (20.17) 
so we readily obtain the following general inequalities: 


RZ < RZ -< RZ 


in — **true — * ‘out’ 


(20.18) 


We plot in Figure 20.1 an illustration of these inequalities. One can see how using 
xg is Clearly overoptimistic and can potentially lead to disastrous results in practice. 
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Figure 20.1 Efficient frontier associated with the mean-variance optimal portfolio (20.10) for g = 1 
and C an inverse-Wishart matrix with p = 0.5, for q = 0.5. The black line depicts the expected 
gain as a function of the true optimal risk (20.11). The gray lines correspond to the realized (out-of- 
sample) risk using either the scm E or its RIE version &. Both estimates are above the true risk, but 
less so for RIE. Finally, the dashed lines represent the predicted (in-sample) risk, again using either 
the scm E or its RIE version &. R and G in arbitrary units, such that Rtrue = 1 for G = 1. 


This conclusion in fact holds for different risk measures, such as the expected shortfall 
measure mentioned in Section 20.1.2. 


20.2 The High-Dimensional Limit 
20.2.1 In-Sample vs Out-of-Sample Risk: Exact Results 


In the limit of large matrices and with some assumptions on the structure g, we can make 
the general inequalities Eq. (20.18) more precise using the random matrix theory tools from 
the previous chapters. Let us suppose that the vector of predictors g points in a random 
direction, in the sense that the covariance matrix C and gg’ are mutually free. This is not 
necessarily a natural assumption. For example, the simplest “agnostic” prediction g = 1 is 
often nearly collinear with the top eigenvector of C (see Section 20.4). So we rather think 
here of market neutral, sector neutral predictors that attempt to capture very idiosyncratic 
characteristics of firms. 
Now, if M is a positive definite matrix that is free from gg’, then in the large N limit: 


gMg 1 7 g 
N N tieg MI freeness N ao ( ) 
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where we recall that t is the normalized trace operator. We can always normalize the 
prediction vector such that g?/N = 1, so setting M = {E~!, C~!}, we can directly estimate 
Eqs. (20.13), (20.11) and (20.15) and find 


2 

2 G 
Rin > ED 

G? 


(20.20) 


Gt (ECE!) 
> 
ou; NT? (E7!) 


Let us focus on the first two terms above. For q < 1, we know from Eq. (17.14) that, in 
the high-dimensional limit, (Co!) =(1- Que): As a result, we have, for N — oo, 


Rin = (1 — Q)Riine- (20.21) 


Hence, for any q € (0,1), we see that the in-sample risk associated with mp always 
provides an overoptimistic estimator. Even better, we are able to quantify precisely the 
risk underestimation factor thanks to Eq. (20.21). 

Next we would like to find the same type of relation for the “out-of-sample” risk. In 
order to do so, we write E = C2W,C? where W, is a white Wishart matrix of parameter 
q, independent from C. Plugging this representation into Eq. (20.15), we find that the out- 
of-sample risk can be expressed as 


3 GrW’) 


when N — oo. Now, since W, and C are asymptotically free, we also have 


HCOW, Jat aw, ): (20.22) 
Hence, using again Eq. (17.14) yields 
t (W7?) 
R? = PA- q}. 20.23 
out G ( q) Nt(C-!) ( ) 


Finally, we know from Eq. (15.22) that t(W,*) =(1- q)? for q < 1. Hence, we finally 
get 


es (20.24) 


All in all, we have obtained the following asymptotic relation: 


R2 
— in = RZ = (1—q)R? (20.25) 
1 = q true q out? 


which holds for a completely general C. 
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Hence, if one invests with the “naive” weights mg, it turns out that the predicted risk Rin 
underestimates the realized risk Rout by a factor (1 — q), and in the extreme case N = T 
or q = 1, the in-sample risk is equal to zero while the out-of-sample risk diverges (for 
N — ov). We thus conclude that, as announced, the use of the scm E for the Markowitz 
optimization problem can lead to disastrous results. This suggests that we should use a 
more reliable estimator of C in order to control the out-of-sample risk. 


20.2.2 Out-of-Sample Risk Minimization 


We insisted throughout the last section that the right quantity to control in portfolio man- 
agement is the realized, out-of-sample risk. It is also clear from Eq. (20.25) that using the 
sample estimate E is a very bad idea, and hence it is natural to wonder which estimator 
of C one should use to minimize this out-of-sample risk? The Markowitz formula (20.10) 
naively suggests that one should look for a faithful estimator of the so-called precision 
matrix C~!. But in fact, since the expected out-of-sample risk involves the matrix C lin- 
early, it is that matrix that should be estimated. 

Let us show this using another route, in the context of rotationally invariant estimators, 
which we considered in Chapter 19. Let us define our RIE as 


N 
a= Py E(Ai) viv), (20.26) 
i=l 


where we recall that v; are the sample eigenvectors of E and &(-) is a function that has to 
be determined using some optimality criterion. 

Suppose that we construct a Markowitz optimal portfolio x s using this RIE. Again, we 
assume that the vector g is random, and independent from &. Consequently, the estimate 
(20.19) is still valid, such that the realized risk associated with the portfolio mg reads, for 
N > œ, 


Te(2-tc27') 


Ru) = F ——— (20.27) 
(T: a!) 
Using the decomposition (20.26) of E, we can rewrite the numerator as 
N T 
vi’ Cv; 
Te(27tc27') ENS (20.28) 
È E? (Ai) 


i=l 


while the denominator of Eq. (20.27) is 


ant Me ge 82 
(Tre ) = (x =a) (20.29) 


Regrouping these last two equations allows us to express Eq. (20.27) as 
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oe vi” Cv; EAS 
Ra) = G? Di Ta =) . (20.30) 


Our aim is to find the optimal shrinkage function (àj) associated with the sample eigen- 
values [A ; iy. i=l? such that the out-of-sample risk is minimized. This can be done by solving, 
for a given j, the following first order condition: 


ORGS) 
dEl j) 
By performing the derivative with respect to (A j) in (20.30), one obtains 


apie e ooa 
BON ean) "PAYG EaD) A EAD) T 


The solution to this equation is given by 


=0, Vj=1,...,N. (20.31) 


&(Aj) = Av,’ Cv, (20.33) 


where A is an arbitrary constant at this stage. But since the trace of the RIE must match that 
of C, this constant A must be equal to 1. Hence we recover precisely the oracle estimator 
that we have studied in Chapter 19. 

As a conclusion, the optimal RIE (19.26) actually minimizes the out-of-sample risk 
within the class of rotationally invariant estimators. Moreover, the corresponding “optimal” 
realized risk is given by 


G 


where we used the notable property that, for any n € Z, 


Ru (E) = (20.34) 


N N 
TEC] = $L EAD” Trviv C] = XL EAD” v? Cv; = Te[(E)"*"]. (20.35) 


i=1 i=l 


20.2.3 The Inverse- Wishart Model: Explicit Results 


In this section, we specialize the result (20.34) to the case when C is an inverse- Wishart 
matrix with parameter p > 0, corresponding to the simple linear shrinkage optimal 
estimator. First, we read from Eq. (15.30) that 


T (e") =e a0) = 1+ p, (20.36) 
so that we get from Eq. (20.20) that, in the large N limit, 
2 
2 G 1l 
Ritue = NI + p (20.37) 


Next, we see from Eq. (20.34) that the optimal out-of-sample risk requires the compu- 
tation of t((E)~!). In general, the computation of this quantity is highly non-trivial but 
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some simplifications appear when C is an inverse-Wishart matrix. In the large-dimension 
limit, the final result reads 


2 
atsl 2) ( 1) <1 _ PO 20.38 
t((8)  ) ( i SE F rear ( ) 


and therefore we have from Eq. (20.34) 


G pt+q+pq 
fo ey ee ES 20.39 
ou (E) N (p+q)( +p) ae 


and so it is clear from Eqs. (20.39) and (20.37) that, for any p > 0, 


Rou) _ ig pq : 
Re (p+q)U+ p) ~ 


true 


1, (20.40) 


where the last inequality becomes an equality only when g = 0, as it should. 
It is also interesting to evaluate the in-sample risk associated with the optimal RIE. It is 
defined by 


ee Tr[(2)~'E(S)~1] 


20.41 
NA) rod 


where the most challenging term is the numerator. Using the fact that the eigenvalues of 
& are given by the linear shrinkage formula (19.49), one can once again find a closed 
formula.? The final result is written 


G p+q 
N (1+ p)(pt+q(p +D) 
and we therefore deduce with Eq. (20.37) that, for any p > 0, 
2 (m 
Rin ®©) A 
p+q+ p) ~ 


R? (Œ) = (20.42) 


1, (20.43) 


where the inequality becomes an equality for q = 0 as above. 
Finally, one may easily check from Eqs. (20.25), (20.40) and (20.43) that 


RZ (E) -RZ Œ) > 0, — Raye (E) — Ray (E) < 0, (20.44) 


showing explicitly that we indeed reduce the overfitting by using the oracle estimator 
instead of the scm in the high-dimensional framework: both the in-sample and out-of- 
sample risks computed using & are closer to the true risk than when computed with the raw 
empirical matrix E. The results shown in Figure 20.1 correspond to the inverse-Wishart 


case with p = q = L. 


Exercise 20.2.1 Optimal portfolio when the true covariance matrix is Wishart 
In this exercise we continue the analysis of Exercise 19.2.2 assuming that 
we measure an scm from data with a true covariance given by a Wishart with 
parameter qo. 


2 Details of this computation can be found in Bun et al. [2017]. 
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(a) The minimum risk portfolio with expected gain G is given by 
-1 
p= Ca (20.45) 
g'Colg 

where C is the covariance matrix (or an estimator of it) and g is the vector of 
expected gains. Compute the matrix & by taking the matrix E; and replacing 
its eigenvalues A, by (àg) and keeping the same eigenvectors. Use the result 
of Exercise 19.2.2(c) for €(A), if some Ax are below 0.05594 or above 3.746 
replace them by 0.05594 and 3.746 respectively. This is so that you do not 
have to worry about finding the correct solution tg(z) for z outside of the 
bulk. 

(b) Build the three portfolios xc, mR and wg by computing Eq. (20.45) for the 
three matrices C, E; and & using G = 1 and g = e4, the vector with 1 in the 
first component and 0 everywhere else. These three portfolios correspond to 
the true optimal, the naive optimal and the cleaned optimal. The true optimal 
is in general unobtainable. For these three portfolios compute the in-sample 
risk Rin := x" E;n, the true risk Rie := 2’ Ca and the out-of-sample risk 
Rout i= Eon. 

(c) Comment on these nine values. For mc and mg you should find exact 
theoretical values. The out-of-sample risk for mg should better than for mg 
but worse than for zc. 


20.3 The Statistics of Price Changes: A Short Overview 
20.3.1 Bachelier’s First Law 


The simplest property of financial prices, dating back to Bachelier’s thesis, states that typi- 
cal price variations grow like the square-root of time. More formally, under the assumption 
that price returns have zero mean (which is usually a good approximation on short time 
scales), then the price variogram 


V(r) := E[(log pr — log pr)7] (20.46) 


grows linearly with time lag t, such that V(t) = o?r. 


20.3.2 Signature Plots 


Assume now that a price series is described by 


t 


log p; = log po + Dov, (20.47) 
p= 


where the return series r; is covariance-stationary with zero mean and covariance 


Cov (ry, ry) = 0°C, (t — t1). (20.48) 
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The case of a random walk with uncorrelated price returns corresponds to C,(u) = 45y,0, 
where 6,,9 is the Kronecker delta function. A trending random walk has C;(u) > O and a 
mean-reverting random walk has C, (u) < 0. How does this affect Bachelier’s first law? 

One important implication is that the volatility observed by sampling price series on a 
given time scale t is itself dependent on that time scale. More precisely, the volatility at 
scale t is given by 


2 V(t) 2 7 u 
o7(t) := — = 0°(1)| 1+ DA (1 = 9) C,(u) |. (20.49) 
T T 
u=1 

A plot of ø (t) versus t is called a volatility signature plot. The case of an uncorrelated ran- 
dom walk leads to a flat signature plot. Positive correlations (which correspond to trends) 
lead to an increase in o (t) with increasing t. Negative correlations (which correspond to 
mean reversion) lead to a decrease in o (t) with increasing T. 


20.3.3 Volatility Signature Plots for Real Price Series 


Quite remarkably, the volatility signature plots of most liquid assets (stocks, futures, 
FX, ...) are nowadays almost flat for values of t ranging from a few seconds to a 
few months (beyond which it becomes dubious whether the statistical assumption of 
stationarity still holds). For example, for the S&P500 E-mini futures contract, which is 
one of the most liquid contracts in the world, o(t) only decreases by about 20% from 
short time scales (seconds) to long time scales (weeks). For single stocks, however, 
some interesting deviations from a flat horizontal line can be detected, see Figure 20.2. 
The exact form of a volatility signature plot depends on the microstructural details of 
the underlying asset, but most liquid contracts in this market have a similar volatility 
signature plot. 


20.3.4 Heavy Tails 


An overwhelming body of empirical evidence from a vast array of financial instruments 
(including stocks, currencies, interest rates, commodities, and even implied volatility) 
shows that unconditional distributions of returns have fat tails, which decay as a power law 
for large arguments and are much heavier than the tails of a Gaussian distribution. 

On short time scales (between about a minute and a few hours), the empirical density 
function of returns r can be fit reasonably well by a Student’s t-distribution, see Figure 
20.3. Student’s t-distributions read 


ltu er 
Pr = — r( 3 ) (1+ es ) í (20.50) 
a f/m T (5) a?u i 


where a is a parameter fixing the scale of r. Student’s t is such that P(r) decays for large 


r as |r|~!~#, where yu is the tail exponent. Empirically, the tail parameter u is consistently 
found to be around 3 for a wide variety of different markets (see Fig. 20.3), which suggests 
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Figure 20.2 Average signature plot for the normalized returns of US stocks, where the x-axis is 
in days. The data consists of the returns of 1725 US companies over the period 2012-2019 (2000 
business days), returns are normalized by a one-year exponential estimate of their past volatility. To 
a first approximation o? (t) is independent of t. The signature plot allows us to see deviations from 
this pure random walk behavior. One can see that stocks tend to mean-revert slightly at short times 
(t < 50 days) and trend at longer times. The effect is stronger on the many low liquidity stocks 
included in this dataset. 
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Figure 20.3 Empirical distribution of normalized daily stock returns compared with a Gaussian and 
a Student’s t-distribution with u = 3 and the same variance. Same data as in Figure 20.2. 


some kind of universality in the mechanism leading to extreme returns. This universality 
hints at the fact that fundamental factors are probably unimportant in determining the 
amplitude of most large price jumps. Interestingly, many studies indeed suggest that large 
price moves are often not associated with an identifiable piece of news that would rationally 
explain wild valuation swings. 
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20.3.5 Volatility Clustering 


Although considering the unconditional distribution of returns is informative, it is also 
somewhat misleading. Returns are in fact very far from being mp random variables — 
although they are indeed nearly uncorrelated, as their (almost) flat signature plots demon- 
strate. Therefore, returns are not simply independent random variables drawn from 
the Student’s t-distribution. Such an 11D model would predict that upon time aggregation the 
distribution of returns would quickly converge to a Gaussian distribution on longer time 
scales. Empirical data indicates that this is not the case, and that returns remain substantially 
non-Gaussian on time scales up to weeks or even months. 

The dynamics of financial markets is in fact highly intermittent, with periods of intense 
activity intertwined with periods of relative calm. In intuitive terms, the volatility of finan- 
cial returns is itself a dynamic variable that changes over time with a broad distribution of 
characteristic frequencies, a phenomenon called heteroskedasticity. In more formal terms, 
returns can be represented by the product of a time dependent volatility component o; and 
an ID directional component €r, 


ry := Ot Et. (20.51) 


In this representation, €; are 11D (but not necessarily Gaussian) random variables of unit 
variance and o; are positive random variables with long memory. This is illustrated in 
Figure 20.4 where we show the autocorrelation of the squared returns, which gives access 
to lo oA]. 

It is worth pointing out that volatilities o and scaled returns € are in fact not independent 
random variables. It is well documented that positive past returns tend to decrease 
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Figure 20.4 Average autocorrelation function of squared daily returns for the US stock data described 
in Figure 20.3. The autocorrelation decays very slowly with the time difference t. A power law t~Y 
with y = 0.5 is plotted to guide the eye. Note the three peaks at t = 65, 130 and 195 business days 
correspond to the periodicity of highly volatile earning announcements. 
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future volatilities and that negative past returns tend to increase future volatilities (i.e. 
Ele+or+r] < 0 for t > 0). This is called the leverage effect. Importantly, however, past 
volatilities do not give much information on the sign of future returns (i.e. E[e;o;47] ~ 0 
fort < 0). 


20.4 Empirical Covariance Matrices 
20.4.1 Empirical Eigenvalue Spectrum 


We are now in position to investigate the empirical covariance matrix E of a collection of 
stocks. For definiteness, we choose q = 0.25 by selecting N = 500 stocks observed at the 
daily time scale, with time series of length T = 2000 days. The distribution of eigenvalues 
of E is shown in Figure 20.5. We observe a rather broad distribution centered around 1, but 
with a slowly decaying tail and a top eigenvalue 4; found to be ~100 times larger than the 
mean. The top eigenvector v corresponds to the dominant risk factor. It is closely aligned 
with the uniform mode [e]; = 1/ WN, i.e. all stocks moving in sync — hence the name 
“market mode” to describe the top eigenvector. Numerically, one finds |v’ e| ~ 0.95. 


20.4.2 A One-Factor Model 


The simplest model aimed at describing the co-movement of different stocks is to assume 
that the return 7;,; of stock i at time ¢ can be decomposed into a common factor f; and ND 
idiosyncratic components £; +, to wit, 
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Figure 20.5 Eigenvalue distribution of the scm, averaged over for three random sets of 500 US stocks, 
each measured on 2000 business days. Returns are normalized as in Figure 20.3, corresponding to 
i = 0.97. The inset shows the complementary cumulative distribution for the largest eigenvalues 
indicating a power-law behavior for large A, as P> (À) © 174/3. Note the largest eigenvalue A) ~% 
0.2N, which corresponds to the “market mode”, i.e. the risk factor where all stocks move in the same 
direction. 
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Vit = Pi ft + i,t, (20.52) 


where f; is often thought of as the “market factor’. Assuming further Sit £i, i are 
uncorrelated random variables of mean zero and variance, respectively, o 7 and o2 the 
covariance matrix C;; is simply given by 


Cij = pipjo} + 51% - (20.53) 


Hence, the matrix C is equal to o71 plus a rank-1 sno oF BB’. The eigenvalues 
are all equal to o2 , save the largest one, equal to o2 + oF | BI’. The corresponding top 
eigenvector uf is parallel to B. When the £;’s are not too far from one another, this top 
eigenvector is aligned with the uniform vector e, as found empirically. 

From the analysis conducted in Chapter 14, we know that the eigenvalues of the empir- 


ical matrix corresponding to such a model are composed of a Martenko—Pastur “sea 
between A_ = o2 — JD? and Ay = of (1 + ay and an outlier located at 


a= 02 +a) (+ Í), (20.54) 
a 


with a = oF B|"/o2, and provided a > „/g (see Section 14.4). Since |B|? = O(N), the 
last condition is easily satisfied for large portfolios. When a > 1, one thus finds 


ài © oF (Bl. (20.55) 


Since empirically 41; ~ 0.2N ae the correlation matrix for which TrC = N, we deduce 
that for that normalization o2 ~ 0.8. The Marčenko-Pastur sea for the value of q = 
1/4 used in Figure 20.5 should thus extend between A_ œ~ 0.2 and à} œ% 1.8. Figure 
20.5 however reveals that ~20 eigenvalues lie beyond A+, a clear sign that more factors 
are needed to describe the co-movement of stocks. This is expected: the industrial sector 
(energy, financial, technology, etc.) to which a given stock belongs is bound to have some 
influence on its returns as well. 


20.4.3 The Rotationally Invariant Estimator for Stocks 


We now determine the optimal RIE corresponding to the empirical spectrum shown in 
Figure 20.5. As explained in Chapter 19, there are two possible ways to do this. One is 
to use Eq. (19.44) with an appropriately regularized empirical Stieltjes transform — for 
example by adding a small imaginary part to à equal to NT !/?. The second is to use a cross- 
validation method, see Eq. (19.88), which is theoretically equivalent as we have shown in 
Section 19.6. The two methods are compared in Figure 20.6, and agree quite well provided 
one chooses a slightly higher, effective value g*, so as to mimic the effect of temporal 
correlations and fluctuating variance that lead to an effective reduction of the size of the 
sample (see the discussion in Section 17.2.3). 

The shape of the non-linear function (A) is interesting. It is broadly in line with the 
inverse-Wishart toy model shown in Figure 19.2: (A) is concave for small A, becomes 
approximately linear within the Maréenko—Pastur region, and becomes convex for larger À. 
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Figure 20.6 Non-linear shrinkage function (à) computed using cross-validation and RIE averaged 
over three datasets. Each dataset consists of 500 US stocks measured over 2000 business days. Cross- 
validation is computed by removing a block of 100 days (20 times) to compute the out-of-sample 
variance of each eigenvector (see Eq. (19.88)). RIE is computed using the sample Stieltjes transform 
evaluated with an imaginary part n = N -1/2 Results are shown for q = N/T = 1/4 and also for 
q* = 1/3, chosen to mimic the effects of temporal correlations and fluctuating variance that lead 
to an effective reduction of the size of the sample (cf. Section 17.2.3). All three curves have been 
regularized through an isotonic fit, i.e. a fit that respects the monotonicity of the function. 


For very large à, however, (A) becomes linear again (not shown), in a way compatible 
with the general formula Eq. (19.57). The use of RIE for portfolio construction in real 
applications is discussed in the references below. 
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Appendix 


Mathematical Tools 


A.1 Saddle Point Method 


In this appendix, we briefly review the saddle point method (sometimes also called the 
Laplace method, the steepest descent or the stationary phase approximation). Consider the 
integral 


+00 
ra edx. (A.1) 


[0.0] 
We want to find an approximation for this integral when t —> ov. First consider the 
case where F(x) is real. The key idea of the Laplace method is that when ¢ is large J 
is dominated by the maximum of F(x) plus Gaussian fluctuations around it. Suppose F 
reaches its maximum at a unique point x*, then around x* we have 
F" (x*) 
2 


where F”(x*) < 0. Thus for large t, we have, after a Gaussian integral over x — x*, 


F(x) = F(x*) + (x — x*)? + O(|x — x*|9), (A.2) 


27 * 
IS a PE, (A.3) 
— F" (x*)t 
where the symbol ~ means that the ratio of both sides of the equation tends to 1 as £ > oo. 
Often we are only interested in 


1 
lim — log I = F(x*), (A.4) 
too t 


in which case we do not need to compute the prefactor. 

Things are more subtle when F(x) is a complex analytic function of z = x + iy. One 
could be tempted to think that one can just apply the Laplace method to Re(F (x)). But this 
is grossly wrong, because exp(it Im(F(x))) oscillates so fast that the contribution of any 
small neighborhood of x* is killed. 

The idea of the steepest descent or the stationary phase approximation relies on the fact 
that, for any analytic function, the Cauchy—Riemann condition ensures that 


V Re(F(z)) - VIm(F(z)) = 0, (A.5) 
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where the gradient is in the two-dimensional complex plane. This means that the contour 
lines of Re(F(z)) are everywhere orthogonal to the contour lines of Im(F(z)), or alterna- 
tively that along the lines of steepest ascent (or descent) of Re(F (z)), the imaginary part 
Im(F (z)) is constant. 

Now, one can always deform the integration contour from the real line in Eq. (A.1) 
to any curve in the complex plane that starts at z = —oo + i0 and ends at z = +00 + 10. 
Since exp(t F (z)) has no poles, this (by Cauchy’s theorem) does not change the value of the 
integral. But again because of the Cauchy—Riemann condition, V? Re(F(z)) = 0, which 
means that Re(F(z)) has no maximum or minimum in the complex plane, but may have 
a saddle point z* where V Re(F(z)) = 0, increasing in one direction and decreasing in 
another (see Fig. A.1). Choosing the contour that crosses the saddle point z* by following a 
path that is locally orthogonal to the contour lines of Re(F(z)) allows the phase Im(F (z)) 
to be stationary, hence avoiding the nullification of the integral by rapid oscillations. The 
final result is then 


pa d (A.6) 
—t F" (z*) . . 


The method is illustrated in Figure A.1 for the example of the Airy function, defined as 


1 +00 z? 
Ain == | dzelF@! F(z,t):=i(z+—). (A.7) 
20 J_oo 3t 
For a fixed, large positive t, the points for which F’(z,t) = 0 are given by z+ = +i ~t. The 


contour lines of Re(F(z,t)) are plotted in Figure A.1. For Im(z) < 0, Re(F(z,t)) — +00 


i(z 4 2/31) for t = 16 (black lines). The 


iso-phase line Im F(z,t) = 0 is also shown (gray lines). The black circle and square are the two 
solutions z+ of F’(z,t) = 0. The relevant saddle is z* = +4i, for which F(z*,t) = —8/3. 


Figure A.1 Contour lines of the real part of F(z, t) 
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when Re(z) — +00, so one cannot deform the contour in that direction without introducing 
enormous uncontrollable contributions. When Im(z) > 0, on the contrary, Re(F (z,t)) > 
—oo when Re(z) — oo, so one can start at z = —oo + 10, and travel upwards in the 
complex plane to the line where Im(F(z,t)) = O in a region where Re(F(z,t)) is so 
negative that there is no contribution to the integral from this part. Then one stays on the 
iso-phase line Im(F (z,t)) = 0 and climbs upwards to the saddle point z* = z+ = +i vf, 
for which F(z*,t) = —2./t/3. One then continues on the same iso-phase line downwards, 
and closes the contour towards z = +00 + i0. Hence, we conclude that 


2 
lim log Ailt) = tF(z*,t) = —=0°””. (A.8) 
tc 3 


A more precise expression, including the prefactor in Eq. (A.6), is written 


2 ,3/2 


1 
re ae 
MO ~ 9 Teele? 


(A.9) 


Exercise A.1.1 Saddle point method for the factorial function: Stirling’s approx- 
imation 
We are going to estimate the factorial function for large arguments using an 
integral representation and the saddle point approximation: 


[0,6] 
ni=T[n+1]= l x"e “dx. (A.10) 
0 


(a) Write n! in the form Eq. (A.1) for some function F(x). 

(b) Show that x* = n is the solution to F’(x) = 0. 

(c) Let Io(n) = nF(x*(n)) be an approximation of log(n!). Compare this 
approximation to the exact value for n = 10 and 100. 

(d) Include the Gaussian corrections to the saddle point: Let 7; (n) = log(/) where 
I is given by Eq. (A.3) for your function F(x). Show that 


I (n) = nlog(n) —n + 5 logn). (A.11) 


(e) Compare J, (n) and log(n!) for n = 10 and 100. 


A.2. Tricomi’s Formula 


Suppose one has the following integral equation to solve for p(x): 


fœ) = f dx’ 2 (A.12) 


x a 
where f(x) is an arbitrary function. This problem arises in many contexts, in particular 
when one studies the equilibrium density of the eigenvalues of some random matrices as in 
Section 5.5, or the equilibrium density of dislocations in solids, see Landau et al. [2012], 
Chapter 30, from which the material of the present appendix is derived. 
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We will consider the case where some “hard walls” are present at x = a and x = b, 
confining the density to an interval smaller than its “natural” extension in the absence of 
these walls. By continuity, we will also get the solution when these walls are exactly located 
at these natural boundaries of the spectrum. ! 


The general solution of Eq. (A.12) is given by Tricomi’s formula which reads 


l (f dx’ / (x! — a)(b — x’) F@) +c) (A.13) 
m2 /(x =a (b= x) Va x—x! A 


where C is a constant to be determined by some conditions that p(x) must fulfill. 
The simplest case corresponds to f(x) = 0, i.e. no confining potential apart from the 
walls. The solution then reads 


p(x) = 


G 
x)= ; A.14 
a aa rE ee 
The normalization of p(x) then yields C = —z and one recovers the arcsine law encoun- 
tered in Sections 5.5, 7.2 and 15.3.1: 
1 
(A.15) 
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The canonical Wigner case corresponds to f(x) = x/20*. We look for the values of a,b 
that are precisely such that the density vanishes at these points, so that the confining walls 


no longer have any effect. By symmetry one must have a = —b. One then obtains 
1 b x —x+x 
x)= dx’ y (x! + b p= x)= 5" +) 
PS 20707 /(x + b)(b — x) (£, v€ X€ ) x—x' 
(A.16) 
Using 
b JOFO) 
xf rh ea ue dt) ig? (A.17) 
—b XX 
one has 
ox) = : (zx? + c’) (A.18) 
27x202 J (x +b) — x) f l 


where C' = C — f$ dx’ V@ F bO — x) = C — r b?/2. 


For p(x) to vanish at the edges, we need to choose C’ = —x b?, finally leading to 


YEER. (A.19) 
2na2 


Finally, for this distribution to be normalized one must choose b = 20, as it should be. 


P(x) = 


1 Formulas directly adapted to two free boundaries, or one free boundary and one hard wall, can be found in Landau et al. 
[2012], Chapter 30. 
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The case studied by Dean and Majumdar, i.e. when all eigenvalues of a Wigner matrix 
are contrained to be positive (Eq. (5.93)), corresponds to Eq. (A.13) with a = 0,b > 0 
and o = 1 (ie. f(x) = x/2), determined again such that p(b) = 0. The solution is 


b* = 4/ /3 and 
1 b* —x 
p(x) = —,] (b* + 2x). (A.20) 
4r x 


Note that very generically the density of eigenvalues (or dislocations) diverges as 1/ v/d 
near a hard wall, where d is the distance to that wall. When the boundary is free, on the 
other hand, the density of eigenvalues (or dislocations) vanishes as Vd. 
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A Toeplitz matrix is such that its elements K;; only depend on the “distance” between 
indices |i — j|. An example of a Toeplitz matrix is provided by the covariance of the 
elements of a stationary time series. For instance consider an AR(1) process x; in the steady 
state, defined by 


Xt = AXt—1 + €1, (A.21) 


where €; are IID centered random numbers with variance 1/(1 — a?) such that x; has unit 
variance in the steady state. Then 


—|t—s| 


Ki, = Elx;xs] = a with0 <a <1, (A.22) 


describing decaying exponential correlations. The parameter a measures the decay of the 
correlation; we can define a correlation time Te := 1/(1 — a). This time is always > 1 
(equal to 1 when K = 1) and tends to infinity as a > 1. More explicitly, K is a T x T 
matrix that reads 


1 a a2 aT -2 T-1 
a 1 a TES sge 
a? a 1 io QS a 
K= . . f . . f ; (A.23) 
qi -2 qi -3 T 1 a 
E p= a 1 


In an infinite system, it can be diagonalized by plane waves (Fourier transform), since 


+00 +00 +00 
5 K,,e27** = e2tixt 5 Kyse? 6-0 = e2tixt >» allei, (A.24) 
s=—0O s=—O0O l=- 


2rixs 


showing that e are eigenvectors of K as soon as its elements only depend on |s — t|. 
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For finite T, however, this diagonalization is only approximate as there are “boundary 
effects” at the edge of the matrix. If the correlation time Te is not too large (i.e. if the 
matrix elements K;; decay sufficiently fast with |t — s|) these boundary effects should be 
negligible. One way to make this diagonalization exact is to modify the matrix so as to have 
the distance |t — s| defined on a circle.” We define a new matrix K by 


K, =a min(|t—s|, |t—s+T|, |t—s T) (A.25) 
1l a e.. a a 
l1 a a a 
y a a 1 at a 
K = : (A.26) 
aaa’... 1 a 
@ @ ... a 1 


It may seem that we have greatly modified matrix K as we have changed about half of its 
elements, but if Te K T most of the elements we have changed were essentially zero and 
remain essentially zero. Only a finite number (~ 212) of elements in the bottom right and 
top left corners have really changed. Changing a finite number of off-diagonal elements in 
an asymptotically large matrix should not change its spectrum. The matrix K, which we 
will call K again, is called a circulant matrix and can be exactly diagonalized by Fourier 
transform for finite 7. More precisely, its eigenvectors are 


[vile = e?" T for 0 < k < T/2. (A.27) 


Note that to each vg correspond two eigenvectors, namely its real and imaginary parts, 
except for vo and v7/2 which are real and have multiplicity 1. The eigenvalues associated 
with k = 0 and k = T/2 are, respectively, the largest (A) and smallest (A_) and are 
given by 


pay l+a 
a e 
pier p O e ee 
à- =1+2 D + (a)? x erg (A.28) 


In terms of the correlation time: A+ = 2tT, — 1. We label the eigenvalues of K by an index 
xk = 2k/T so that 0 < x, < 1. As T —> ov, xx becomes a continuous parameter x and the 
different multiplicity of the first and the last eigenvalues does not matter. The eigenvalues 
can be written 


1 — a? 


À = 
(x) 1 +a? — 2a cos(nx) 


fr0<x <1. (A.29) 


2 The Toeplitz matrix K can in fact be diagonalized exactly, see O. Narayan, B. Sriram Shastry, arXiv:2006.15436v2. 
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For a more general form K;,; = K (|t — s|), the eigenvalues read 


Mx) =1+2 > K (€) cos(a x8). (A.30) 
l=1 


The T-transform of K can then be computed as 


1 1—42 
w= [ EE acy a (A31) 
Using 
, dx 1 
f c — d cos(x x) = Jeo—dVc+d 2) 
and after some manipulations we find 
tkz) = : fie. (A.33) 
VZ= ee ae A4 lFa 
We can also deduce the density of eigenvalues (see Fig. A.2): 
PK(A) = : forà- <A < à+. (A.34) 


TA (A— A_)(Az — A) 


This density has integrable singularities at à = A. It is normalized and its mean is 1. We 
can also invert tk (z) with the equation 


(A.35) 


1 
roe = 2bt? oq + t? —1=0, where b = a 


Figure A.2 Density of eigenvalues for the decaying exponential covariance matrix K for three values 
of a: 0.25, 0.5 and 0.75. 
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and get 
bt? + y (b2 — 1)t4 +t? 
ék(t) = 2 i (A.36) 
so the S-transform is given by 
t+1 
Sx(t) = ia (A37) 


y 1 + (b2? — 1)t? + bt 


=1-— (b-— Dt +0@), 


where the last equality tells us that the matrix K has mean 1 and variance OK =b-l= 
2 2 
a“ /(1 — a°). 
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